Monitoring Linux System and Processes
using NetDiagnostics

Overview

In a pre-production and production environment, there are many Linux servers and many processes and services running in each server. These processes are app-based processes doing their tasks.
Many times, there are health issues of the servers due to one or more processes running on server that are using excessive resources which impacts user experience. Most of the time, it is not easy to diagnose the root cause of the issue as issue gets cleared after some time and there is no data available to diagnose the issue.

NetDiagnostics has a comprehensive monitoring, alerting and dashboard capabilities. Using NetDiagnostics, all the metrics of servers and key processes are collected and stored in its big data and retained for several years.

Using this data, NetDiagnostics alerts when any of these metrics are going from normal to warning state based on the thresholds and give early warning before it reaches critical state. Once alert comes, using correlation, one can find which processes are causing the issue.
Using dashboard, one can see system and process health to see the current state and the trend over last N hours or days or weeks.

System Monitoring

NetDiagnostics supports following System Level metrics:
SysStats linux Extended: Capture information of processes, memory, paging, block IO, traps and CPU activity.
System Load Stats: How long the system has been running, how many users are currently logged on and the system load averages for the past 1, 5, 15 minutes, Open file hard/soft limit, Open Files and Open Files (Pct).
Memory Stats Extended: Amount of free and used memory in the system.
File System stats: Amount of disk space available on the file system of free and used memory in the system.
Device Stats: Device utilization stats per physical device or on partition basis.
TCP states counts: Number of TCP connections in different TCP states.
TCP Stats Rate: TCP protocol such as number of packets and ACK send, received and different errors.
Network Delay: Network Delay shows the network delay (ms), Percentage loss of packets and the ratio of the max and min round trip time.

System Metrics

Following snapshot shows the tree of System Metrics and Top 5 servers by CPU Usage (%)

Service Monitoring

All services running in Linux servers can be monitored using single configuration.
Following snapshot shows the tree of Service Metrics and Top 5 services by CPU Usage (%):

Process Monitoring

NetDiagnostics supports following Process Level metrics:

  • Process IO Stats – Process IO read, IO write, IO write canceled, and IO delayed.
  • Process Stats – Process elapsed time, CPU time, memory used, shared memory size and open files of a process.

You can monitor additional processes in addition to services monitored using service monitoring, by specifying a search patterns of all key processes.
Following snapshot shows the tree of Process Metrics and Top 5 processes by CPU Usage (%):

Linux System and Process Monitoring Concepts

A program loaded into the memory of a Linux computer becomes a process. Processes need to be managed and monitored because they consume system resources like CPU time, memory and disk space. There are also security and safety implications. Monitoring and managing processes is, therefore, an important function of DevOps.

When it comes to process monitoring for Unix systems, you have multiple options using different tools:

The ‘top’ program is a very powerful utility that provides a great deal of information about your running system. This includes data about memory usage, CPU loads, and a list of running processes including the amount of CPU time and memory being utilized by each process. ‘Top’ displays system information in near real-time, updating (by default) every three seconds.

A sample output from the top program is shown in Figure. The output from top is divided into two sections, which are called the “summary” section, which is the top section of the output, and the “process” section, which is the lower portion of the output.

The top command is already pretty readable, but there is a command that makes everything even more readable than that: atop, htop, and glances.

Htop is an advanced Linux process monitoring tool which is similar to “Top” but offers some rich features like interactive process viewer, vertical and horizontal process viewer, shortcut keys, etc. It’s a third-party Linux monitoring tool that doesn’t come pre-installed in Linux or Unix system. You need to download and install it in the system.

With the help of above command, anyone can check who is causing performance issue on your system but how we will get the information in below cases:

• System Overloaded
• Do not have the remote access and machine get down suddenly due to load or any other reason.

For this situation, we want to go back in time to check which process(es) are causing execution issue. Running the top command is not a good option, as it would be very late if server is overloaded already.

With NetDiagnostics monitoring dashboard, one can essentially return in time and see which process was causing the issue.
NetDiagnostics provides functionality to monitor top N services & specified process based on search criteria. In dashboard, we have the feature to show the historical data for a selected period.

Dashboard

In NDE, there are pre-built dashboards and you can create your own. Following is one example showing system and process health:

A. First row displays the system performance like Current CPU Utilization, Current Memory Utilization, Avg of CPU Utilization, Load Avg of last 1 min and CPU I/O Wait.
B. Second row displays metrics for CPU Usage, Memory Usage, and Thread count for multiple processes.
C. Third row displays process IO stats and provides CPU usage of top five services. High IO by a process can make the system unhealthy.

Alerting

NetDiagnostics has an in-built support of proactive alerting through which users can get notified whenever a Key Performance Indicator (KPI) metrics like CPU utilization, request per second, average response time etc. breaches the threshold or deviates from baseline data. This allows users to get an early notification of performance degradation even before an actual issue happens.

Alert Rules

Alert rule is the key element where user defines the metric and associated threshold to generate an alert for a specific severity such as critical, major, or minor.

Here we configure the rule for CPU Utilization of the NetDiagnostics server.

Alert History

Alert history is useful in obtaining insights into past-generated alerts. Alert history is also used to understand how severity of specific alerts may have changed over a period of time.
Here, we are showing critical alert of CPU Utilization.

NetDiagnostics Monitoring Architecture

NetDiagnostics has extensive monitoring capabilities. Following is high-level architecture.

You need to install Cavisson Agent on all servers and it will collect all requested metrics and send to NetDiagnostics Server.