NetHavoc Overview

The overall performance of a service is directly linked, among other things, to its ability to tolerate failures. This aspect of an application/software can be tested by deliberately injecting random faults and failures into the application infrastructure.

NetHavoc is a powerful feature added to NetStorm, which allows users to test the resilience of the applications. NetHavoc can be used to inject various faults into the application infrastructure during a load test. The after effects of the fault injection can be monitored through NetStorm’s powerful monitoring capabilities.

Types of Faults

Various faults can be developed in an application during its run. These include:

  • Instance(s) or server(s) down
  • Process termination
  • Server ill health (high CPU, memory, disk space, disk failures etc.)
  • Network outage or slowness

Key Features

  • Faults can be injected randomly in the production and/or the staging environment (during a load test or even in production) and the after effects monitored using the NDE infrastructure (Failure as a Service).
  • Faults can be injected by the Fault Injection software based on different parameters including:
    • Time (off peak hours)
    • Probability (of the fault occurring)
    • Spacing (between two faults)
    • Severity (instance(s), server(s), Tier(s), DC going down)
    • Partial fault (disable network interface) to full fault (server power down)
  • Faults can be injected in different services:
    • Application servers
    • Load balancers

Injecting Faults

A user can inject fault(s) in any running instance/sever at specified time or random time. NetHavoc simulates failure at a random point in time interval. User needs to provide specific inputs according to faults. System calls some APIs accordingly, which will inject the specified fault into the running instance/server.

Follow the below mentioned steps for injecting faults and analyzing the application behavior:

  1. On the product UI home page, click the NetHavoc icon () on the left menu.

2. This displays the NetHavoc window if any test is currently running. Else, a message is displayed that no test is running currently and so faults cannot be configured.

  1. This window contains two sections – Fault Injection Configurations (where the configuration for fault injection is to be performed) and Fault Injection Summary (where the configured details are displayed and can be applied for resilience testing).

Fault Injection Configuration

  1. Select the tier, server, and connection from the corresponding drop-down lists.
  2. Select one of the fault type from the available ones and provide the corresponding details:
  • Server Termination: In this Fault type, user can terminate a thread on Azure server. User needs to provide the following details:
    • Cloud Type: Cloud type as Azure.
    • Computer Name: Cloud Server name / IP.
    • Resource Group Name: They are logical containers for a collection of resources that can be treated as one logical instance. User can use resource groups to control all of their members collectively.
    • User Name / Password: These are Server credentials.
  • High CPU load: In this Fault type, NetHavoc Burns CPU on mentioned server(s). Default value is 10% and maximum can be 80%.
    • CPU Burn %: The % of CPU to burn on the Server. It must be greater than the already present % of CPU on the server. Range is 0-80%
  • Low Memory: NetHavoc can increase memory utilization by decreasing the available memory mounted on File system(s) on mentioned server(s). Default value is 10% and maximum can be 80% memory utilization.
    • Partition: The partition on which the File system of the Server has been mounted.
    • Total Target Memory: The amount of memory to be consumed (in total) on the server. It is the sum of presently utilized memory and the memory needs to be consumed further. Range is 0-80 %.
  • Application Termination: NetHavoc can terminate any application on the server(s). Termination can be done using Process ID / Name.
    • Process ID: The ID of the process, which has to be killed by the fault.
    • Process Name: Name of the process needs to be killed by the user.
  • Network Corruption: User can corrupt network incoming packets from the server(s) and outgoing packets to the server(s). Default value is 10% and maximum can be 50% corruption.
    • Network Interface: It is the point of interconnection between a computer and a private or public network. The drop-down suggests all the interfaces that has been added on the server.
    • Corruption Percentage: The % of network packets needs to be corrupted or manipulated.
  • Network Latency: Latency is the amount of time a message takes to traverse a system. In a computer network, it is an expression of how much time it takes for a packet of data to get from one designated point to another. As corruption, user can generate latency in the packet receiving and sending to the server(s). User can provide a fix value or range can also be provided.
    • Network Interface: Provide the network interface.
    • Specified: Provide integer value of the field. The packets will be delayed by the specified amount of time.
    • Random: The packets will be delayed by any time between the two given values.
  • Network Loss: Like Network Corruption and Latency, user can apply loss in the incoming and outgoing packets to the server(s). Default value is 10% and maximum can be 100% loss.
    • Network Interface: Provide the network interface.
    • Corruption: The % of network packets need to be corrupted or manipulated.
    • Loss %: The % of packets lost in the network.

Fault Injection Time

This signifies the time at which the faults are to be injected based on the configurations. A user is provided with various options for fault injection time:

  • Random: User can specify the start date & time and end date & time along with the duration for injecting faults. The system picks any random time between the specified time and injects faults based on the provided duration.

  • Specified: Here, user needs to specify the exact date & time and duration at which the configured fault is to be injected.

  • Current: Here, system captures the current time and injects faults for the specified duration from current date and time.

Fault Injection Summary

  1. Once all the configurations are done, click the Add This adds the following details in the Fault Injection Summary section:
  • Tier
  • Server
  • Connection Type
  • Fault type
  • Time Period
  • Start Date
  • Start Time
  • Duration
  • Status
  • Output

Operations

A user can perform following operations once the configurations are done:

  • Add: To add the fault in the system. It will be applied for the current time / specified start time and duration once user clicks the Apply
  • Delete: To delete the injected fault from the system. The fault can be deleted only in the case if status is Ready to Apply. If the status is Scheduled or Running, the fault cannot be deleted.
  • Update: To update the injected fault in the system. The fault can be updated only in the case if status is Ready to Apply or Scheduled. If it reaches to Running status, the fault cannot be updated.
  • Apply: To apply the injected fault for the specified start time and duration.
  • Stop: To stop the injected fault forcefully when it is in Scheduled
  • Refresh: To refresh the NetHavoc UI. Upon doing this, the changes will be reflected across all users using the same machine.

Examples

Below are certain examples that demonstrates injecting various faults and their impact in web dashboard:

High CPU Load

Specify the required details and select fault type as High CPU Load. Upon adding the fault, initially the status is Ready to Apply.

Upon clicking the Apply button, the status is changed to Scheduled.

Once it reaches the Start time, the status is changed to Running.

Upon executing the fault and reaching the specified duration, its status is changed to Completed.

User can view the impact of injected fault in web dashboard where the deviation can be seen for the specified time and duration. In this case, the start time was 10:52 and duration was 5 minutes (therefore the end time became 10:57).

Here, user can see that when fault is injected, CPU utilization reached up to 80%, however which was normal (i.e. ~40%) when the fault was not injected on it.

Low Memory

In this case, when the fault for low memory is added and applied, its status is changed to Running upon reaching the specified start time.

Once the fault is implemented for the specified duration, the status is changed to Completed.

User can view the impact of injected fault in web dashboard where the deviation can be seen for the specified time and duration. In this case, the start time was 11:30 and duration was 5 minutes (therefore the end time became 11:35).

Here, user can see that when fault is injected, memory is decreased up to 260 GB, however which was ~320 GB when the fault was not injected on it.

Application Termination

Here, a fault for application termination is injected, which terminates an application from the system. Upon adding this, the status is changed to Ready to Apply.

Once it is applied and executed for the specified start time and duration, its status is changed to Completed.

User can view the impact of injected fault in web dashboard where the deviation can be seen for the specified time and duration. In this case, the start time was 16:26 and duration was 1 minute (therefore the end time became 16:27).

Here, user can see that when fault is injected, application is terminated at the specified time i.e. 16:26.