adm.cluster e storage.grid down
Description
07h25 e o Ganglia acusa a adm.cluster e a storage.grid down 05h15 min atras. O load da spg00 bateu em 318! O Nara acusa um pico de tr'afego das 05h ate as 06 de 30Mb/s. GFTP spraid,sRM-spraid e spraid_# offline para dcache.
[root@sprace:root]#ping spraid
PING spraid.if.usp.br (200.136.80.5) from 200.136.80.3 : 56(84) bytes of data.
64 bytes from spraid.if.usp.br (200.136.80.5): icmp_seq=1 ttl=64 time=1.09 ms
64 bytes from spraid.if.usp.br (200.136.80.5): icmp_seq=2 ttl=64 time=0.173 ms
--- spraid.if.usp.br ping statistics ---
2 packets transmitted, 2 received, 0% loss, time 1004ms
rtt min/avg/max/mdev = 0.173/0.634/1.095/0.461 ms
Mas nao responde ssh. Estamos UP para o resto.
condor_status
na spgrid acusa todas as maquinas ok. A SPRace esta ok (estou digitando nela).
Updates
07h53 . A SPRaid voltou ao ar. Infelizmente me loguei no console mas foi inutil: a quantidade de logs
Out of memory: process java ...
impossibilitaram qualquer operacao em qualquer terminal, entao rebootei na unha e o
fsck
deu conta do recado.
O
ntpd
nao subiu OK
Agora, no sprace
[root@spraid root]# /opt/d-cache/bin/dcache-core start
/pnfs/if.usp.br/ not mounted - going to mount it now ...
Starting dcache services:
Starting gridftp-spraidDomain 6 5 4 3 2 1 0 Done (pid=1133)
Starting srm-spraidDomain 6 5 4 3 2 1 0 Done (pid=1223)
[root@spraid root]# /opt/d-cache/bin/dcache-pool start
Starting dcache pool: Starting spraidDomain 6 5 4 3 2 1 0 Done (pid=1400)
[root@spraid root]# /etc/init.d/ntpd restart
Shutting down ntpd: [ OK ]
ntpd: Synchronizing with time server: [ OK ]
Starting ntpd: [ OK ]
Demorou mas o monitoramento em
http://spdc00.if.usp.br:2288/cellInfo
acusou ok para os spraid_1, spraid_2, etc
Fulano em dd/mm/aaaa