NetCloud Troubleshoot

Unable to login NetCloud machine from CLI

Possible Reasons #1 Machine IP is down
1) If machine is down
Screen12) if machine is up
Screen2
Steps to Diagnose Ping machine IP from terminal using “ping” command.
Command Used ping 1.2.3.4,

where 1.2.3.4 is machine IP.

Commands to validate / Solution Contact machine owner.


Unable to start NetCloud test. Getting error.

Possible Reasons #1 1) License errorScreen3
Steps to Diagnose Check license is valid or not using command
nsu_show_license -l1) This showing license file is not present2) This showing invalid licence/expired license.
Command Used nsu_show_license -l
Solution Contact to cavisson client support team for new license.

Possible Reasons #2 Cmon may not be running on controller and generatorsScreen4
Steps to Diagnose Check cmon is running or not using ps command.
Command Used ps -ef |grep cmon
Solution Start/ restart cmon using command

/etc/init.d/cmon start

Possible Reasons #3 Due to generator file missing on the controller.Screen5
Steps to Diagnose Check file availability in /home/cavisson/etc/,netcloud directory from CLI
Command Used ls -ltra /home/cavisson/etc/,netcloud
Solution Add generators from UI. This option is available at
scenarios>>add generator>>generator file UI.

Possible Reasons #4 4)Generator information is missing in generator fileScreen6
Steps to Diagnose Check InitScreen UI. This will show an error regarding Generator information is missing
Solution Add generators from UI. This option is available at scenarios>>add generator>>generator file UI.

Possible Reasons #5 Any wrong keyword used in scenario Screen7
Steps to Diagnose Check InitScreen UI. This will show an error regarding wrong/missing keyword used.
Command Used vi scenrioName.conf
Solution Correct keyword from scenario UI.

Possible Reasons #6 Script is not compiled. Screen8
Steps to Diagnose InitScreen will show an error regarding script
Solution Correct script from script manager.

Possible Reasons #7 PostgreSQL service is not running on controller. Screen9
Steps to Diagnose Check using ps command
Command Used ps -ef |grep postgresql
Solution start postgresql service


Users went down

Possible Reasons #1 1) Test stopped on few/all generators.

2) Some CVMs got killed on generators.

Screen12

Steps to Diagnose

Check below logs

1) $NS_WDIR/logs/TRXX/partition/ns_logs/ns_trace.log

2) $NS_WDIR/logs/TRXX/TestRunOutput.log

Test stopped or CVMs killed due to core dump on code function or system kernel.
Check dmesg -T for segfault. Also check core file at /home/cavisson/core_files.
Check backtrace using gdb and analyse frames where dump created.

Command Used vi

1) $NS_WDIR/logs/TRXX/partition/ns_logs/ns_trace.log

2) $NS_WDIR/logs/TRXX/TestRunOutput.log

Solution Contact cavisson product team for code fix.


Generators got discarded

Possible Reasons #1 Generator is busy and generator got killed due to delay in progress report Screen13
Steps to Diagnose

1) Check ns_trace.log present at $NS_WDIR/logs/TRxx/partition/ns_logs/ns_trace.log

2) search error message “Did not get progress report for 300000 msecs” in ns_trace.log.

2) Check netstat logs of all the generators and controllers.
>> controller netstat log will be at $NS_WDIR/logs/TRxx
Generator netstat logs will be at $NS_WDIR/logs/TRxx/NetCloud/generator_name/TRxx/netstat.txt

3) From above logs you can check where data is stcuk from rec Q or send Q.

4) In this case all NVMs will be busy upto 99%”

Command Used Use keyword as below
NUM_NVM 2 MACHINE. it will generate a total 4 NVMs.Note: value 2 is the example.
Solution Provide sufficient number of NVMs in test to sustain load.

Possible Reasons #1 Bandwidth is fully utilised
Screen14
Steps to Diagnose Check graph of Received Throughput from Dashboard.

This graphs available at Test Metrics >> Https request >> TCP receive throughput

Command Used 1) We can check stuck sample in Send queue by using command
netstat -natp2) And for checking bandwidth utilisation we can use a command
nload, iftop, iperf etc
Solution Reduce load from that particular generator.

Possible Reasons #3 Controller doesn’t send acknowledgement message for the generator.
Screen15
Steps to Diagnose 1) Check controller system health like load average.
How to troubleshoot load average has already been explained.2) Check whether controller sent acknowledgement to generator or not in ns_trace.log. path of this log has mentioned above.
Command Used Top
Solution Make controller health stable

Possible Reasons #4 Old or Bad kernel on Generator Machine
Steps to Diagnose Check kernel on generator using linux command uname -r.
Command Used uname -r
Solution Upgrade latest kernel.

Possible Reasons #5 NVMs of generator are stuck
Screen16
Steps to Diagnose

1) This case happens when an NVM gets stuck because of resources blocked to use for example, if disk IO or CPU utilisation is high of generator machine. Then NVM can’t process and delay comes in sample generation.

2) Check scripts using in test. May be there is some loop applied where NVMs are stuck in process.

Command Used $NS_WDIR/scripts/project/subProject/scrips_name
Solution Correct script


Getting 100% failure on generators

Possible Reasons #1 Generators IPs are not whitelisted at application end
Screen26
Steps to Diagnose Check host using ping command or using wget.
Command Used 1) ping hostname

2) wget hostname

Solution Not getting page dump Report

Possible Reasons #2 G_TRACING keyword is not enabled in the scenario
Steps to Diagnose Check scenario.
Command Used Check KeywordDefination.dat file at $NS_WDIR/etc
Solution Use right Keyword in scenario


CPU utilization is high

Possible Reasons #1 System CPUs are occupied by all process and it is unavailable for processing other requests
Screen10
Steps to Diagnose Check using top command which processes are taking more cpu to process.
Go through below link for more debugging.
https://bobcares.com/blog/high-cpu-utilization/
Commands to validate top
Solution Fix that resources those are taking more cpu.
1) Fixes on configuration level.2) Stop process if not needed.


Load Average is High

Possible Reasons #1 System is overloaded where many processes are waiting for system resources Screen11
Steps to Diagnose Check using top command which processes are taking more system resources like CPU, RAM, Disk etc.
Go through below link for more debugging .
https://martincarstenbach.wordpress.com/2013/06/25/troubleshooting-high-load-average-on-linux/
Commands to validate top
Solution Fix that process taking more system resources


NetCloud test stuck on database creation

Possible Reasons #1 Database is busy on some other task to process
Screen17
Steps to Diagnose 1) Check any process running for database or any uploading or downloading happening in db.
Commands to validate
Solution

Possible Reasons #2

Possible Reasons #3

1.) Sometimes nsu_db_upload process running of older testruns those are not running currently.

2.) Sometimes older nia_file_aggrigator process are running

Steps to Diagnose 1) Check using ps -ef |grep nsu_db_upload.

2) Check any test running with corresponding process using nsu_show_all_netstorm

Commands to validate 1) ps -ef |grep nsu_db_upload.

2) nsu_show_all_netstorm

3) kill -9 pid

Solution Stop these older process by killing them


NetCloud test fails in middle of test

Possible Reason #1

Possible Reason #2

1.) This may happen due to core dump on controller due to some fault in code or due to system kernel

2.) This may happen due to NVMs failure with core dump on failed generators

Steps to Diagnose

Check below logs

1) $NS_WDIR/logs/TRXX/partition/ns_logs/ns_trace.log

2) $NS_WDIR/logs/TRXX/TestRunOutput.log

Test stopped or CVMs killed due to core dump on code function or system kernel.
Check dmesg -T for segfault. Also check core file at /home/cavisson/core_files.
Check backtrace using gdb and analyse frames where dump created.

Commands to validate gdb
Solution Contact cavisson client support team

Possible Reasons #3 Not enough space left on controller/generators
Steps to Diagnose 1) Check using df -h

2) Also can be check using nsu_sever_admin command

Commands to validate 1) Df -h

2) nsu_server_admin -s ip -c “df -h”

Solution

Possible Reasons #4 Some has stopped test forcefully Screen18
Steps to Diagnose Ping generator IP
Commands to validate ping ip
Solution contact CS team

Possible Reasons #5 Generator went down Screen19
Steps to Diagnose Check nsu_stop_test.log. It is present at
$NS_WDIR/logs/TRxx
Commands to validate vi $NS_WDIR/logs/TRxx/nsu_stop_test.log
Solution Restart test if required


Not able to start test due to shared memory issue

Possible Reasons #1 Check NS/Generator System Health
Screen20
Steps to Diagnose 1) Run command cat /proc/sys/kernel/shmmax

2) Value must be greater than buffer request in script.

3) On cavisson cloud machine there is approximate 20GB

Commands to validate cat /proc/sys/kernel/shmmax
Solution Check this value. It required root access


Unable to start test due to unknown host error

Possible Reasons #1 Due to DNS nameserver missing in entry file. Screen25
Steps to Diagnose 1) Check file cat /var/run/dnsmasq/resolv.conf

2) Entry will be like nameserver 8.8.8.8

3) If entry is not there then enter value manually.

Commands to validate cat /var/run/dnsmasq/resolv.conf.
Solution Keep nameserver entry in resolv.conf file

Possible Reasons #2 Due to host not reachable from source IP
Steps to Diagnose Check host using ping command or using wget.
Commands to validate 1) ping hostname

2) wget hostname

Solution Get whitelist source ip to host application.