NetDiagnostics – DR HA Guidelines

Disaster Recovery

Cavisson takes the availability and performance of its products very seriously. We’ve created recovery guidelines to help customers recover from any event which results in failing of the NetDiagnostics Enterprise Server.

Agents typically don’t fail, but we assume that they follow the business continuity strategy of the application under test (AUT) or application to be monitored set by our customer already, because agents reside within the same application servers. Agents will only fail when the application server (being monitored) itself fails.

NetDiagnostics appliance is designed to be physically close to the agents because appliance comprises of the controller that serves the agents, hence the Active NetDiagnostics Server by design architecture is hosted within the same data center where the application being monitored is hosted. Refer Figure 1.

1

Figure 1: A typical NetDiagnostics setup

As a best practice Cavisson recommends setting up of a backup server – exact replica of the active NetDiagnostics Server (with similar configuration, controller settings, database, profiles, etc.). This backup server will be up and running, but will work only passively, to sync all data and recent configuration files from the active Net Diagnostics Server for a configured sync cycle time. The agents will keep sending monitoring data to the active NetDiagnostics Server only. Refer Figure 2.

Process for Setting-up Backup Server

Following is a high level process for setting up NetDiagnostics backup server:

Approach 1 – Setting up 1 Active NetDiagnostics: 1 Backup Server
  1. A complete copy of the protected hardware in the same data center is needed. This will be an exact replica of the active NetDiagnostics Server (including controllers, database, and all configuration files) installed in the same data center already.
  1. Approved Backup / DR licenses that allow for a second system to be running (with agents reporting to only to the active NetDiagnostics (ND) server), needs to be provided to the backup server.
  1. Disk syncing (configurable for data, days) from active ND Server to the backup ND Server, for things like session store, config files, database, caches, etc.
  2. A VIP (virtual ip) is to be implemented that can be switched to point from Active ND Server to the Backup ND Server in the event of a required failover.
  3. For a DR strategy across data centers, additional backup ND Server archiving can be hosted in a separate data center. This is entirely optional. 

2

Figure 2: Backup NetDiagnostics Server within the same data center

Approach 2- Multiple Active NetDiagnostics: 1 Backup Server

Ideally, this is recommended to overcome cost barriers associated with hardware requirement in an environment which hosts multiple NetDiagnostics Server.

  1. For 4 or less NetDiagnostics server within the same environment, one backup server is needed in the same data center. This backup server will store configurations of all the active servers (including controllers, database, and all configuration files) installed in the same data center already.
  1. Approved Backup / DR licenses that allow for this second system to be running (with agents reporting to only to the active NetDiagnostics (NDE) server), needs to be provided to the backup server.
  1. In many to one approach, the backup server will keep limited data (desired and defined required or critical by the customer), besides configuration information of all the active NetDiagnostics Servers.
  2. A VIP (virtual ip) is to be implemented that can be switched to point from any Active ND Server to the Backup ND Server in the event of a required failover.
  3. For a DR strategy across data centers, additional backup ND Server archiving can be hosted in a separate data center. This is entirely optional.

3

Figure 3: Many to One Backup Server Approach for NetDiagnostics

High Availability

All Cavisson machines are enterprise class machines having high-end configurations, tested within labs as well as toughest situations at our customer’s environment. However we have ensured maximum availability of these machines focusing on rare possibility of occurrence of few unforeseen issues related to:

Network Availability: All Cavisson appliances come with 2 Fiber Optic channels (for ND Data) as well as 2 copper based network interfaces (for UI data), to ensure maximum network availability in case of failure of the primary network interface.

Power Source: All Cavisson appliances come with a provision of alternate power source to ensure maximum availability in case of failure of the primary power source.

Failover and Recovery

Based on the business criticality, Cavisson recommends the following failover and recovery mechanism.

ND Failover

The failure is detected in an event when Local Traffic Manager Load Balancer doesn’t receives response from the active NetDiagnostics Server. In such a case Local Traffic Manager Load Balancer will point to the Backup ND Server and now the Backup ND Server will take over and resume the monitoring.

This mechanism offers fastest recovery with minimal loss of data. With help of Local Traffic Manager Load Balancer the Backup ND Server takes over the active position as soon as the first active NetDiagnostics appliance goes down. To resume monitoring the session needs to be activated to get the Backup NetDiagnostics Server in a full Active mode.

The Backup ND Server will almost pick up from where the previous active server left off (except a few minutes owing to the detection by the LTM Load Balancer). The last synced data (per configuration) will be available immediately.

4

Figure 4: Failover of NetDiagnostics Server

Zone Failover

This is a failover approach for an infrastructure within the same datacenter. A Zone is a replica of the active infrastructure setup.

5

Figure 5: Zone Failover Approach / Setup

A Zone failure is detected in an event when Local Traffic Manager Load Balancer doesn’t receives response from the Zone (including active or the backup ND Servers). In such a scenario the failed zone now becomes inactive and Local Traffic Manager Load Balancer will point to the inactive (backup) Zone and make it active.

Within the zone the Active and Inactive (backup) ND arrangement works similar to as mentioned in previous ND Failover section. To resume monitoring the session needs to be activated to get the NetDiagnostics Server within the newly activated zone in a full Active mode.

The Active ND Server will now almost pick up from where the previous active server left off (except a few minutes owing to the detection by the LTM Load Balancer). The last synced data (per configuration) will be available immediately.

6

Figure 6: Failover of Zone

Data Failover Service Level Agreements

The availability of data in a failover process is totally dependent on the back and recovery configurations done along with the database sync-time applied during the back-up server implementation.

In terms of availability of previously monitored or captured data, following SLA’s are maintained by Cavisson as standard policy. There is two types of data, Metrics and Diagnostics

Availability of Metrics Data:

All Data – 1 year

Event Days data – 3 years

Aggregated Data (hourly as standard policy*) – 3 years

*Configurable so that appropriate sizing of appliance / disk can be done.

Availability of Diagnostics Data:

Online Data – 30 days

Offline (archived) Data – 90 days

Event Days (archived) – 3 years

Time taken to restore data is dependent on the restoration mechanism and availability of the back-up / archived data.