Quick *nix (mostly Linux) Health Checks #1

Post date: Apr 12, 2013 3:24:14 PM

Once upon a time, I spent a whole day trying to convince a junior system administrator to run lsof and tell me the size of largest file open because my app was performing really bad and the file system was filling up no matter what.

Since he was clueless about what lsof does and why a insanely large file ruins your performance, he just ignored me until things became so ugly, he got escalated and his boss provided me the lsof results (Yeah, I could not just become root and do it. Large corporations have these separation of duty rules).

I identified the evil process and killed it, recovering app's performance. Kudos!

So, kids, run lsof whenever your application goes weird. It may be looping through a file descriptor and writing huge amounts of data to the disk!

It must be run as root:

# 1 - Check top 30 processes with number of files opened:

sudo lsof | awk '$5 == "REG" {freq[$2]++ ; names[$2] = $1 ;} END {for (pid in freq) print freq[pid], names[pid], pid ; }' | sort -n -r -k 1,1 | head -30

Get used to your app's regular number of opened files and stay tuned for any unusual large number!

# 2 - Check the file sizes ordering by size, descending:

sudo lsof -s | awk '$5 == "REG"' | sort -n -r -k 7,7 | head -n 30

Get used to your app's regular opened files size. Of course this varies a LOT from app to app, but usually after taking care of some app for sometime you get used to BAU parameters. Also, if you see a 5,000 Gb file and your app is not launching any nuclear missile, hey, c'mon...

(One liners stolen from http://thegoogleof.blogspot.com/2011/11/lsof-sort.html - I could do it if I want! Anytime! :P)

Now let's talk memory. The free command is your friend, a complicated friend but still a friend.

[root@xyz ~]# free -m

total used free shared buffers cached

Mem: 7596 5475 2121 0 21 532

-/+ buffers/cache: 4921 2675

Swap: 8191 1138 7053

-m is to display data in megabytes.

So, I have 2,121 Mb free, right? Wrong. I have 2,675 Mb. And the OS is really using 4,921, not 5,475 (because this includes disk caching which is freed whenever your apps need RAM).

Of course this is a bit more complicated because it includes kernel slab reclaimable and 'free' command behavior, that ignores it. For a real good post on memory, read here:


If you need specifics on a process (Linux), get its PID, example:

[root@xyz ~]# ps aux | grep java | grep -v grep

joe 22209 3.3 1.8 954764 143336 ? Sl 07:56 8:39 /opt/ibm/lotus/notes

Then cat its info from /proc:

[root@xyz~]# cat /proc/22209/status

Name: notes2

State: S (sleeping)

Tgid: 22209

Pid: 22209

PPid: 1

TracerPid: 0

Uid: 500 500 500 500

Gid: 500 500 500 500

Utrace: 0

FDSize: 1024

Groups: 18 498 500 502

VmPeak: 1028952 kB

VmSize: 954764 kB

VmLck: 0 kB

VmHWM: 228884 kB

VmRSS: 143336 kB

VmData: 601076 kB

VmStk: 100 kB

VmExe: 16 kB

VmLib: 176108 kB

VmPTE: 868 kB

VmSwap: 52096 kB

Threads: 103

SigQ: 1/60589

SigPnd: 0000000000000000

ShdPnd: 0000000000000000

SigBlk: 0000000000300000

SigIgn: 0000000000301000

SigCgt: 20000001c20864ff

CapInh: 0000000000000000

CapPrm: 0000000000000000

CapEff: 0000000000000000

CapBnd: ffffffffffffffff

Cpus_allowed: ff

Cpus_allowed_list: 0-7

Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001

Mems_allowed_list: 0

voluntary_ctxt_switches: 441126

nonvoluntary_ctxt_switches: 26452

Of course there's a lot of info there you don't need. VmPeak is a good one to know what's the maximum size your app ate from the system.

For detailed description on this: