Cavisson DR and HA Objective
The objective of DR is to ensure that there is no interruption in the continuous monitoring, as a result of the hardware, or the software, or the network failure.
The DR and HA strategy focuses primarily on two components:
- Cavisson Agent
- Cavisson Server
Agents typically do not fail, but we assume that they follow the business continuity strategy of the Application Under Test (AUT) or application to be monitored set by our customer already, because agents reside within the same application servers. Agents only fail when the application server (being monitored) itself fails. Following sections provide insight into the DR and HA strategy and implementation details for Cavisson Server; in this case Cavisson NetDiagnostics Enterprise.
NetDiagnostics Enterprise appliance should be physically close to the agents because the appliance comprises of a controller that serves the agents, therefore, the Active NetDiagnostics Server by design architecture is hosted within the same data center where the application being monitored is hosted.
Below are the hardware requirements, software requirements, and network requirements for DR and HA implementation:
- The DR is implemented using two NDE servers.
- The first server being the master (active) and the other as the backup (passive).
- There should be standard configuration of both the NDE servers and it should be same.
(Please refer to the standard NDE server configuration)
- For using Virtual IP (VIP), both the NDE servers should be on same subnet else we use DNS.
- Backup server should have sufficient memory and disk space for monitoring. Actual need can vary from setup to setup.
- Both NDE servers should be connected with fiber optics interface.
- Purge policy should be implemented on both NDE servers to remove unwanted reports, logs, files, etc.
- Both the master and the backup servers should have same version of Keepalived Daemon installed and running. The supported Keepalived version is 4.1.
- Both the master and the backup servers should be running the same Cavisson release. The DR and HA strategy is available for Cavisson Release 4.1.10 #31 and above only.
- Both the NDE servers should be running on Ubuntu 16.
- Connectivity: For backup, firewall should be opened from application server to backup appliance.
We are achieving HA using Keepalived, which uses VRRP protocol. Data is synced regularly from master to backup using rsync. Both NDE servers should have a virtual IP.
Scenario – 1: One-on-One Master/Backup Mapping
In this scenario, we have one Master (NDBox1) and one Backup NDE (NDBox2). The Virtual IP is assigned to Master (Active NDE) Server. In an event, where Master fails, the Virtual IP of Master is automatically applied to the Backup machine.
Scenario – 2: Cross Mapping
As illustrated in the image, the existing Active NetDiagnostics Server (ND Server 1) Box is leveraged to create Backup NetDiagnostics Server for another ND server (ND Server 2). This happens by creating separate VM on another controller of that server. Now NDBox1 will be able to have a backup for an Active ND Server running on NDBox2. Similarly, NDBox2 will accommodate backup of active ND server running on NDBox1.
Application/agent is running on both the machines. Each machine is having two controllers. First controller is for Master and second controller is for Backup.
- Controller -1 (Master) of NDBox1 is connected with Controller -2 (Backup) of NDBox2.
- Controller -1 (Master) of NDBox2 is connected with Controller -2 (Backup) of NDBox1.
In this scenario, one virtual IP is needed for each physical machine (for master and for backup).
Failover / Switching Backup to Master
In an event when the continuous monitoring is interrupted due to unavailability of Master (Active) server then the backup collector on the corresponding machine automatically takes over as a Master server, becomes active, and resumes continuous monitoring.
Before starting the test, sync the primary and the backup controller because first time it may take time depending upon the size of data on the primary controller.
We can mark the NDE health as good or critical on multiple parameters as follows.
- Machine load average
- Disk usage
- CPU usage
- Issue with any of the NDE processes
- Network issues
If the health of NDE is critical for n consecutive intervals, failover will happen.
Agent configuration should have IP address of backup ND IP address. In case of critical health of NDBox1, all agents will close connection with NDBox1 and connect to NDBox2. Monitoring data will start coming on this NDE machine.
Below are the details for application agent configuration, machine agent configuration, and NetDiagnostics collector configuration.
Application Agent Configuration
Following needs to be added in ndsetting.conf file:
- tier=<The value is alphanumeric string>
- server=<The value is alphanumeric string>
- instance=<The value is alphanumeric string>
- ndcHost= <The value is either in IPv4 address in DDN (Decimal Dotted Notation) or alphanumeric string>
- backupNdcHostAndPort=<IP1 or host name>:<Port>;<IP2 or host name>:<Port>
- retryPolicy=<retry count> <retry interval in seconds>. The default value for retry count is 3 and default retry interval is 60s.
Machine Agent Configuration
Similarly, for machine agent following needs to be entered in cmon.env file,
- backupNdcHostAndPort=<IP1 or host name>:<Port>;<IP2 or host name>:<Port>. This field is optional. However, if defined, then value of host name is mandatory. The value of port is optional which will default to 7892
- retryPolicy=<retry count> <retry interval in seconds>. The default value for retry count is 3 and default retry interval is 60s. Above machine agent configuration arguments are the additional argument required in cmon.env.
NetDiagnostics Collector Configuration
Additional two keywords are required to be present in the ndc.conf (configuration) file
- ND_DR_RETRY_POLICY (This is for what is the Retry Count Policy need to be followed).
- ND_DR_STATE (This states whether the DR State is enabled or disabled).
Data Failover Service Level Agreements
The availability of data in a failover process is totally dependent on the back and recovery configurations done along with the database sync-time applied during the back-up server implementation.
In terms of availability of previously monitored or captured data, Cavisson as standard policy maintains following SLAs. There is two types of data, Metrics and Diagnostics.
Availability of Metrics Data
- All Data – 1 year
- Event Days data – 3 years
- Aggregated Data (hourly as standard policy*) – 3 years
Note: It is configurable so that appropriate sizing of appliance / disk can be done.
Availability of Diagnostics Data
- Online Data – 30 days
- Offline (archived) Data – 90 days
- Event Days (archived) – 3 years
Note: Time taken to restore data is dependent on the restoration mechanism and availability of the back-up / archived data.
Verify Successful DR / HA
To verify successful DR / HA implementation, follow the below mentioned steps:
- Login to ND UI.
- Check if dashboard data is properly displaying.
- Check if favorites and templates are properly synced.
- Check if alert rules, baseline, and policies are implemented properly.
- Check switch over test.
- Check if dashboard data is properly displaying.