Queda temporaria no Cluster
Description
Sao 10h56. O ganglia afirma que somente a spg00 esta up. O painel frontal da SPRAID esta piscando.
Entretanto o resultado do condor :
[mdias@sprace mdias]$ ssh spgrid '. /OSG/setup.sh ;condor_status'
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
vm1@node01.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:35:04
vm2@node01.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:35:43
vm1@node02.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:35:04
vm2@node02.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:35:41
vm1@node03.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:35:04
vm2@node03.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:35:39
vm1@node04.gr LINUX INTEL Unclaimed Idle 0.000 500 0+00:24:57
vm2@node04.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:35:41
vm1@node05.gr LINUX INTEL Unclaimed Idle 0.000 500 0+21:49:34
vm2@node05.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:35:05
vm1@node06.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:35:04
vm2@node06.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:35:39
vm1@node07.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:35:04
vm2@node07.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:35:40
vm1@node08.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:35:04
vm2@node08.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:35:37
vm1@node09.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:39:57
vm2@node09.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:35:05
vm1@node10.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:35:04
vm2@node10.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:35:40
vm1@node11.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:35:05
vm2@node11.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:35:36
vm1@node12.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:35:05
vm2@node12.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:35:39
vm1@node13.gr LINUX INTEL Unclaimed Idle 0.000 500 0+05:44:37
vm2@node13.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:35:05
vm1@node14.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:35:04
vm2@node14.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:35:40
vm1@node15.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:25:04
vm2@node15.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:25:35
vm1@node16.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:35:04
vm2@node16.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:35:35
vm1@node17.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:35:04
vm2@node17.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:35:39
vm1@node18.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:35:04
vm2@node18.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:35:37
vm1@node21.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:30:05
vm2@node21.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:30:41
vm1@node22.gr LINUX INTEL Unclaimed Idle 0.000 500 0+01:30:04
vm2@node22.gr LINUX INTEL Unclaimed Idle 0.000 500 1+01:30:38
vm1@node23.gr LINUX INTEL Unclaimed Idle 0.000 1003 0+01:30:04
vm2@node23.gr LINUX INTEL Unclaimed Idle 0.000 1003 1+01:30:35
vm1@spgrid.if LINUX INTEL Unclaimed Idle 1.000 1003 0+02:10:04
vm2@spgrid.if LINUX INTEL Unclaimed Idle 10.460 1003 1+02:10:53
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 44 0 0 44 0 0 0
Total 44 0 0 44 0 0 0
Os nos aceitam ping
[root@sprace:root]# ping node38
PING node38.cluster (192.168.1.38) from 192.168.1.200 : 56(84) bytes of data.
64 bytes from node38.cluster (192.168.1.38): icmp_seq=1 ttl=64 time=0.193 ms
64 bytes from node38.cluster (192.168.1.38): icmp_seq=2 ttl=64 time=0.190 ms
--- node38.cluster ping statistics ---
2 packets transmitted, 2 received, 0% loss, time 999ms
rtt min/avg/max/mdev = 0.190/0.191/0.193/0.013 ms
mas entrar via ssh nao.
Na spraid
[mdias@spraid mdias]$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda2 2063536 599940 1358772 31% /
none 1027720 0 1027720 0% /dev/shm
/dev/sda7 1035660 34728 948324 4% /tmp
/dev/sda5 10317828 2196156 7597556 23% /usr
/dev/sda8 15346304 1444488 13122264 10% /usr/local
/dev/sda6 2063504 413860 1544824 22% /var
/dev/sdb1 1833096736 92955700 1647025088 6% /raid0
/dev/sdc1 1833096736 963934088 776046700 56% /raid1
/dev/sdd1 1730092600 264919452 1377289568 17% /raid2
/dev/sde1 1730092600 225076752 1417132268 14% /raid3
/dev/sdf1 1730092600 208326584 1433882436 13% /raid4
/dev/sdg1 1730092600 220788532 1421420488 14% /raid5
spdc00:/pnfsdoors 400000 80000 284000 22% /pnfs/if.usp.br
[mdias@spraid mdias]$ free
total used free shared buffers cached
Mem: 2055440 2038452 16988 0 581612 1175224
-/+ buffers/cache: 281616 1773824
Swap: 4192956 12724 4180232
Tambem estamos down no
http://cms-project-phedex.web.cern.ch/cms-project-phedex/cgi-bin/browser.
Updates
11h05. Nao mexi em nada e estamos ok novamente no Phedex e no ganglia, mas extremamente instavel, com alguns nodes caindo de tempos em tempos. Os logs estao ok. A spg00 esta " pingavel" mas atingiu o pico de utilizacao,. Estou logado na spraid
[mdias@spraid mdias]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 2.0G 586M 1.3G 31% /
none 1004M 0 1004M 0% /dev/shm
/dev/sda7 1012M 34M 927M 4% /tmp
/dev/sda5 9.9G 2.1G 7.3G 23% /usr
/dev/sda8 15G 1.4G 13G 10% /usr/local
/dev/sda6 2.0G 405M 1.5G 22% /var
/dev/sdb1 1.8T 89G 1.6T 6% /raid0
/dev/sdc1 1.8T 920G 741G 56% /raid1
/dev/sdd1 1.7T 250G 1.3T 16% /raid2
/dev/sde1 1.7T 212G 1.4T 14% /raid3
/dev/sdf1 1.7T 198G 1.4T 13% /raid4
/dev/sdg1 1.7T 210G 1.4T 14% /raid5
spdc00:/pnfsdoors 391M 79M 278M 22% /pnfs/if.usp.br
[mdias@spraid mdias]$ free
total used free shared buffers cached
Mem: 2055440 2038660 16780 0 579184 1174588
-/+ buffers/cache: 284888 1770552
Swap: 4192956 12684 4180272