NetCloud Troubleshoot

Unable to login NetCloud machine from CLI

Possible Reasons #1Machine IP is down
1) If machine is down
Screen12) if machine is up
Screen2
Steps to DiagnosePing machine IP from terminal using “ping” command.
Command Usedping 1.2.3.4,

where 1.2.3.4 is machine IP.

Commands to validate / SolutionContact machine owner.


Unable to start NetCloud test. Getting error.

Possible Reasons #11) License errorScreen3
Steps to DiagnoseCheck license is valid or not using command
nsu_show_license -l1) This showing license file is not present2) This showing invalid licence/expired license.
Command Usednsu_show_license -l
SolutionContact to cavisson client support team for new license.

Possible Reasons #2Cmon may not be running on controller and generatorsScreen4
Steps to DiagnoseCheck cmon is running or not using ps command.
Command Usedps -ef |grep cmon
SolutionStart/ restart cmon using command

/etc/init.d/cmon start

Possible Reasons #3Due to generator file missing on the controller.Screen5
Steps to DiagnoseCheck file availability in /home/cavisson/etc/,netcloud directory from CLI
Command Usedls -ltra /home/cavisson/etc/,netcloud
SolutionAdd generators from UI. This option is available at
scenarios>>add generator>>generator file UI.

Possible Reasons #44)Generator information is missing in generator fileScreen6
Steps to DiagnoseCheck InitScreen UI. This will show an error regarding Generator information is missing
SolutionAdd generators from UI. This option is available at scenarios>>add generator>>generator file UI.

Possible Reasons #5Any wrong keyword used in scenario Screen7
Steps to DiagnoseCheck InitScreen UI. This will show an error regarding wrong/missing keyword used.
Command Usedvi scenrioName.conf
SolutionCorrect keyword from scenario UI.

Possible Reasons #6Script is not compiled. Screen8
Steps to DiagnoseInitScreen will show an error regarding script
SolutionCorrect script from script manager.

Possible Reasons #7PostgreSQL service is not running on controller. Screen9
Steps to DiagnoseCheck using ps command
Command Usedps -ef |grep postgresql
Solutionstart postgresql service


Users went down

Possible Reasons #11) Test stopped on few/all generators.

2) Some CVMs got killed on generators.

Screen12

Steps to Diagnose

Check below logs

1) $NS_WDIR/logs/TRXX/partition/ns_logs/ns_trace.log

2) $NS_WDIR/logs/TRXX/TestRunOutput.log

Test stopped or CVMs killed due to core dump on code function or system kernel.
Check dmesg -T for segfault. Also check core file at /home/cavisson/core_files.
Check backtrace using gdb and analyse frames where dump created.

Command Usedvi

1) $NS_WDIR/logs/TRXX/partition/ns_logs/ns_trace.log

2) $NS_WDIR/logs/TRXX/TestRunOutput.log

SolutionContact cavisson product team for code fix.


Generators got discarded

Possible Reasons #1Generator is busy and generator got killed due to delay in progress report Screen13
Steps to Diagnose

1) Check ns_trace.log present at $NS_WDIR/logs/TRxx/partition/ns_logs/ns_trace.log

2) search error message “Did not get progress report for 300000 msecs” in ns_trace.log.

2) Check netstat logs of all the generators and controllers.
>> controller netstat log will be at $NS_WDIR/logs/TRxx
Generator netstat logs will be at $NS_WDIR/logs/TRxx/NetCloud/generator_name/TRxx/netstat.txt

3) From above logs you can check where data is stcuk from rec Q or send Q.

4) In this case all NVMs will be busy upto 99%”

Command UsedUse keyword as below
NUM_NVM 2 MACHINE. it will generate a total 4 NVMs.Note: value 2 is the example.
SolutionProvide sufficient number of NVMs in test to sustain load.

Possible Reasons #1Bandwidth is fully utilised
Screen14
Steps to DiagnoseCheck graph of Received Throughput from Dashboard.

This graphs available at Test Metrics >> Https request >> TCP receive throughput

Command Used1) We can check stuck sample in Send queue by using command
netstat -natp2) And for checking bandwidth utilisation we can use a command
nload, iftop, iperf etc
SolutionReduce load from that particular generator.

Possible Reasons #3Controller doesn’t send acknowledgement message for the generator.
Screen15
Steps to Diagnose1) Check controller system health like load average.
How to troubleshoot load average has already been explained.2) Check whether controller sent acknowledgement to generator or not in ns_trace.log. path of this log has mentioned above.
Command UsedTop
SolutionMake controller health stable

Possible Reasons #4Old or Bad kernel on Generator Machine
Steps to DiagnoseCheck kernel on generator using linux command uname -r.
Command Useduname -r
SolutionUpgrade latest kernel.

Possible Reasons #5NVMs of generator are stuck
Screen16
Steps to Diagnose

1) This case happens when an NVM gets stuck because of resources blocked to use for example, if disk IO or CPU utilisation is high of generator machine. Then NVM can’t process and delay comes in sample generation.

2) Check scripts using in test. May be there is some loop applied where NVMs are stuck in process.

Command Used$NS_WDIR/scripts/project/subProject/scrips_name
SolutionCorrect script


Getting 100% failure on generators

Possible Reasons #1Generators IPs are not whitelisted at application end
Screen26
Steps to DiagnoseCheck host using ping command or using wget.
Command Used1) ping hostname

2) wget hostname

SolutionNot getting page dump Report

Possible Reasons #2G_TRACING keyword is not enabled in the scenario
Steps to DiagnoseCheck scenario.
Command UsedCheck KeywordDefination.dat file at $NS_WDIR/etc
SolutionUse right Keyword in scenario


CPU utilization is high

Possible Reasons #1System CPUs are occupied by all process and it is unavailable for processing other requests
Screen10
Steps to DiagnoseCheck using top command which processes are taking more cpu to process.
Go through below link for more debugging.
https://bobcares.com/blog/high-cpu-utilization/
Commands to validatetop
SolutionFix that resources those are taking more cpu.
1) Fixes on configuration level.2) Stop process if not needed.


Load Average is High

Possible Reasons #1System is overloaded where many processes are waiting for system resources Screen11
Steps to DiagnoseCheck using top command which processes are taking more system resources like CPU, RAM, Disk etc.
Go through below link for more debugging .
https://martincarstenbach.wordpress.com/2013/06/25/troubleshooting-high-load-average-on-linux/
Commands to validatetop
SolutionFix that process taking more system resources


NetCloud test stuck on database creation

Possible Reasons #1Database is busy on some other task to process
Screen17
Steps to Diagnose1) Check any process running for database or any uploading or downloading happening in db.
Commands to validate
Solution

Possible Reasons #2

Possible Reasons #3

1.) Sometimes nsu_db_upload process running of older testruns those are not running currently.

2.) Sometimes older nia_file_aggrigator process are running

Steps to Diagnose1) Check using ps -ef |grep nsu_db_upload.

2) Check any test running with corresponding process using nsu_show_all_netstorm

Commands to validate1) ps -ef |grep nsu_db_upload.

2) nsu_show_all_netstorm

3) kill -9 pid

SolutionStop these older process by killing them


NetCloud test fails in middle of test

Possible Reason #1

Possible Reason #2

1.) This may happen due to core dump on controller due to some fault in code or due to system kernel

2.) This may happen due to NVMs failure with core dump on failed generators

Steps to Diagnose

Check below logs

1) $NS_WDIR/logs/TRXX/partition/ns_logs/ns_trace.log

2) $NS_WDIR/logs/TRXX/TestRunOutput.log

Test stopped or CVMs killed due to core dump on code function or system kernel.
Check dmesg -T for segfault. Also check core file at /home/cavisson/core_files.
Check backtrace using gdb and analyse frames where dump created.

Commands to validategdb
SolutionContact cavisson client support team

Possible Reasons #3Not enough space left on controller/generators
Steps to Diagnose1) Check using df -h

2) Also can be check using nsu_sever_admin command

Commands to validate1) Df -h

2) nsu_server_admin -s ip -c “df -h”

Solution

Possible Reasons #4Some has stopped test forcefully Screen18
Steps to DiagnosePing generator IP
Commands to validateping ip
Solutioncontact CS team

Possible Reasons #5Generator went down Screen19
Steps to DiagnoseCheck nsu_stop_test.log. It is present at
$NS_WDIR/logs/TRxx
Commands to validatevi $NS_WDIR/logs/TRxx/nsu_stop_test.log
SolutionRestart test if required


Not able to start test due to shared memory issue

Possible Reasons #1Check NS/Generator System Health
Screen20
Steps to Diagnose1) Run command cat /proc/sys/kernel/shmmax

2) Value must be greater than buffer request in script.

3) On cavisson cloud machine there is approximate 20GB

Commands to validatecat /proc/sys/kernel/shmmax
SolutionCheck this value. It required root access


Unable to start test due to unknown host error

Possible Reasons #1Due to DNS nameserver missing in entry file. Screen25
Steps to Diagnose1) Check file cat /var/run/dnsmasq/resolv.conf

2) Entry will be like nameserver 8.8.8.8

3) If entry is not there then enter value manually.

Commands to validatecat /var/run/dnsmasq/resolv.conf.
SolutionKeep nameserver entry in resolv.conf file

Possible Reasons #2Due to host not reachable from source IP
Steps to DiagnoseCheck host using ping command or using wget.
Commands to validate1) ping hostname

2) wget hostname

SolutionGet whitelist source ip to host application.