Phedex e OSG fora do ar.
Description
07h44 e o nosso Prodution Component Status do
PhEDEx esta down a 10h25min. O site da OSG també deu o seguinte retorno do seus testes
Authentication: Pass 2006-10-09 09:04:29 GMT
Hello World: Fail
Command:
globus-job-run spgrid.if.usp.br:2119 /bin/sh -c "echo Hello World ; echo Hello_World_DONE"
Reason:
Timeout ; output : /usr/local/globusc/globus/bin/globus-job-run: line 1: 18198 Killed /usr/local/globusc/globus/bin/globusrun -q -o -r "spgrid.if.usp.br:2119" -f /tmp/globus_job_run.osggridcat.rsl.18125 ; status : 246 2006-10-09 09:06:09 GMT
CONDOR Batch System:
-Batch Query: Pass 2006-10-09 09:04:35 GMT
-Batch Sub: Pass 2006-10-09 09:04:35 GMT
-Batch Cancel: Fail
Command:
globus-job-clean -force -r spgrid.if.usp.br:2119/jobmanager-fork https://spgrid.if.usp.br:_port_range_port_/number1/number2
Reason:
Unknown ; output: Could not clean up job. ; status: 245 2006-10-09 09:04:36 GMT
gsiftp: Pass 2006-10-09 09:06:14 GMT
Web Service Hello World: Pass 2006-10-09 09:07:17 GMT
Updates
Vou restartar o serviço da phedex. Pelo que me parece o grid proxy é válido:
[root@spdc00 root]# su - phedex
[phedex@spdc00 phedex]$ grid-proxy-info
subject : /DC=org/DC=doegrids/OU=People/CN=Eduardo Gregores 407221/CN=proxy/CN=proxy/CN=proxy
issuer : /DC=org/DC=doegrids/OU=People/CN=Eduardo Gregores 407221/CN=proxy/CN=proxy
identity : /DC=org/DC=doegrids/OU=People/CN=Eduardo Gregores 407221
type : full legacy globus proxy
strength : 1024 bits
path : /home/phedex/gridcert/proxy.cert
timeleft : 11:16:16
então:
[phedex@spdc00 phedex]$ Master -config ~/SITECONF/local/PhEDEx/Config.Prod stop
[phedex@spdc00 phedex]$ Master -config ~/SITECONF/local/PhEDEx/Config.Prod start
FileDownload: pid 29035 already running in /home/phedex/state/download-master-prod
FileDiskExport: pid 29041 already running in /home/phedex/state/exp-disk-prod
InfoDropStatus: pid 29047 already running in /home/phedex/state/info-ds-prod
FilePFNExport: pid 29053 already running in /home/phedex/state/exp-pfn-prod
mas mesmo às 08h27 não conseguimos entrar no serviço com UP. Restartei novamente.
[phedex@spdc00 phedex]$ tail -n 10 /home/phedex/logs/download-master
2006-09-30 22:01:30: FileDownload[6579]: xstats: to=T2_SPRACE_Buffer from=T1_CERN_Load fileid=3610 state=100 size=2074217787 time_assigned=3856.96 time_all=2835.94 time_preclean=0.22 time_transfer=594.32 time_validate=2223.72 time_postclean=7.23 lfn=/store/test/2006/06/16/IntegrationLargeSample/0000/LoadTest_T1_CERN_0070 from_pfn=srm://srm.cern.ch:8443/srm/managerv1?SFN=/castor/cern.ch/cms/store/test/2006/06/16/IntegrationLargeSample/0000/LoadTest_T1_CERN_0070 to_pfn=srm://spdc00.if.usp.br:8443/srm/managerv1?SFN=/pnfs/if.usp.br/data/cms/store/test/2006/06/16/IntegrationLargeSample/0000/LoadTest_T1_CERN_0070
2006-09-30 22:01:31: FileDownload[6579]: Stopped all pending jobs
O log da spgrid sobre os problemas com o monitoramento da OSG dão
[mdias@spgrid mdias]$ tail -f /OSG/globus/var/globus-gatekeeper.log
PID: 25742 -- Notice: 5: and local gid: 524
TIME: Mon Oct 9 08:28:32 2006
PID: 25742 -- Notice: 0: executing /usr/local/opt/OSG/globus/libexec/globus-job-manager
TIME: Mon Oct 9 08:28:32 2006
PID: 25742 -- Notice: 0: GATEKEEPER_JM_ID 2006-10-09.08:28:32.0000025742.0000000000 for /DC=org/DC=doegrids/OU=People/CN=Leigh Grundhoefer (GridCat) 693100 on 129.79.4.64
TIME: Mon Oct 9 08:28:32 2006
PID: 25742 -- Notice: 0: GRID_SECURITY_CONTEXT_FD=11
TIME: Mon Oct 9 08:28:32 2006
PID: 25742 -- Notice: 0: Child 25771 started
sh: line 1: /var/tmp/gratia.log: Permission denied
o que parece normal.Vou tentar restartar o SC4
[phedex@spdc00 phedex]$ Master -config ~/SITECONF/local/PhEDEx/Config.SC4 start
FileDownload: removing old stop flag /home/phedex/state/download-master/stop
FileDownload: pid 19841 started in /home/phedex/state/download-master
FileDiskExport: removing old stop flag /home/phedex/state/exp-disk/stop
FileDiskExport: pid 19847 started in /home/phedex/state/exp-disk
InfoDropStatus: removing old stop flag /home/phedex/state/info-ds/stop
InfoDropStatus: pid 19853 started in /home/phedex/state/info-ds
FilePFNExport: removing old stop flag /home/phedex/state/exp-pfn/stop
FilePFNExport: pid 19859 started in /home/phedex/state/exp-pfn
FileRecycler: removing old stop flag /home/phedex/state/download-recycle/stop
[phedex@spdc00 phedex]$ FileRecycler: pid 19865 started in /home/phedex/state/download-recycle
[phedex@spdc00 phedex]$ Master -config ~/SITECONF/local/PhEDEx/Config.Prod start
FileDownload: pid 29035 already running in /home/phedex/state/download-master-prod
FileDiskExport: pid 29041 already running in /home/phedex/state/exp-disk-prod
InfoDropStatus: pid 29047 already running in /home/phedex/state/info-ds-prod
FilePFNExport: pid 29053 already running in /home/phedex/state/exp-pfn-prod
[phedex@spdc00 phedex]$ tail -n 20 /home/phedex/logs/download-master
2006-09-30 22:01:31: FileDownload[6579]: Stopped all pending jobs
2006-10-09 11:35:54: FileDownload[19841]: (re)connecting to database
UPDATE
O Eduardo resolveu o problema.