srm Troubleshooting.

Description

Limpando um pool

Três discos de um mesmo raid cairam. Primeiramente levantamos o pool e antes de iniciar o dCache, tiramos a linha

spraid02_1  /raid1/pool  sticky=allowed recover-space recover-control recover-anyway lfs=precious tag.hostname=spraid02

do arquivo /opt/d-cache/config/spraid02.poollist

Setamos na interface de administração o pool como disabled . A partir do dCache admin:

ssh -c blowfish -p 22223 -l admin localhost
(local) admin > cd PoolManager
(PoolManager) admin > psu set disabled spraid02_1
(PoolManager) admin > ..
(local) admin > logoff

Retiramos o pool do database

ssh -c blowfish -p 22223 -l admin localhost
> cd spraid02_1
> pnfs unregister

Referências: aqui

Baixa qualidade nas transferências do PhEDEx

As transferências SPRACE -> T1_* começaram a falhar. Usando o FTS como descrito abaixo:

export X509_USER_PROXY=/home/phedex/gridcert/proxy.cert;
PHEDEX_GLITE_ENV=/usr/local/glite/3.1.27-0/etc/profile.d/grid-env.sh;
source $PHEDEX_GLITE_ENV;
TIER1_FTS_SERVICE=USCMS-FNAL-WC1;
TIER1_FTS_SERVER=$(glite-sd-query -e -t org.glite.FileTransfer -s ${TIER1_FTS_SERVICE});
/home/phedex/sw/slc4_ia32_gcc345/cms/PHEDEX/PHEDEX_3_2_1/Utilities/ftscp -copyjobfile=/tmp/teste -passfile=/home/phedex/SITECONF/SPRACE/PhEDEx/ftspass -m=myproxy-fts.cern.ch -mode=multi -mapfile=/home/phedex/SITECONF/SPRACE/PhEDEx/fts.map 
Reason:      TRANSFER error during TRANSFER phase: [GRIDFTP_ERROR]
Too many open files

mesmo com o srm funcionando. Solução: ir nos pools e restarta-los mas antes :

ulimit -n 32000

adicionando esta linha em /opt/d-cache/bin/dcache depois do start) e restart)

Init Failed: File not found

Um dos pools com problemas, mostrando a mensagem acima em http://osg-se.sprace.org.br:2288/usageInfo pois alguns pnfsIDs que ele gravou não correspondiam a nenhum dado no metadata (diferente dos orphan files, em que o dado existe, mas não existe pnfsID). Olhando o log da spraid02 ( no /var/log/spraid02Domain.log, o "File not found") ele mostrava os pnfsIDs problematicos. Primeiro é necessário anotar quais são os dados com problemas para solicitar depois a transferências deles para a farm novamente:

. /usr/etc/pnfsSetup
export PATH=$PATH:$pnfs/tools
cd /pnfs/`hostname -d`/data
pathfinder 000100000000000000C7BA18

Depois remover da spraid02, o /raid1/control/data/000100000000000000C7BA18 correspondente. Foi necessário rebootar a máquina. Depois o procedimento é de conferir: entrar na interface de administração da SE e verificar se existem replicas correspondentes (ou se ele removeu mesmo)

 ssh -c blowfish -p 22223 -l admin localhost
cd PnfsManager
cacheinfoof 000100000000000000C7BA18
..
logoff

ele não deve mostrar nenhum pool com essa entrada. Dar um rep rm -force 000100000000000000C7BA18 direto na interface de administração não funcionou porque o pool não subia para fazer o comando (deve haver um meio menos burro, talvez entrando direto no postgresql e removendo usando mysql)

Connection Timeout

Third party transfers srmcp srm://A srm://B are failling (except to FNAL).

Checked firewall iptables -vL

iptables -I INPUT -p TCP --dport 2811 -m state --state NEW -j ACCEPT
iptables -I INPUT -p TCP --dport 20000:25000 -m state --state NEW -j ACCEPT
iptables -I INPUT -p udp --dport 20000:25000 -j ACCEPT

Checked certificates (pools and servers) openssl verify -CApath /etc/grid-security/certificates/ /etc/grid-security/hostcert.pem
in our main server, osg-se, we increase our debug level ( srm.batch and gridftpdoor.batch )

set printout default 4

Investigating looking at srm log

tail -f /opt/d-cache/libexec/apache-tomcat-5.5.20/logs/catalina.out

Upgrade to dcache 1.9.0-10 : Not works.

Some parameters changed:

To avoid "connection timeout" on FTS transfers is necessary to change (server and pools):

vim /opt/d-cache/config/gridftpdoor.batch
set context -c performanceMarkerPeriod 10
/opt/d-cache/bin/dcache restart gridftp-spraid01Domain

Not solved our main problem yet.

Transfers between our storage areas

srcmp srm://osg-se.sprace.org.br:8443/pnfs/sprace.org.br/data/mdias/test.1srm://osg-se.sprace.org.br:8443/pnfs/sprace.org.br/data/mdias/test.1

, it solved, at the same file above:

set context -c gsiftpAdapterInternalInterface 192.168.1.152

But not solved our main problem Configuration removed: works without it

some srm changes /opt/d-cache/config/srm.batch :

set context -c srmVacuum            false 
set context -c srmPutReqThreadPoolSize               500
set context -c srmCopyReqThreadPoolSize               500
set context -c srmGetLifeTime      28800000
et context -c srmPutLifeTime      28800000
set context -c srmCopyLifeTime     28800000
set context -c remoteGsiftpMaxTransfers 550

and restarted our admin. None success. Configuration removed: works without it

some tunning in all servers ifconfig eth0 txqueuelen 20000 . Quality didn't increase.

From Savannah ticket:

I had a look at cmswiki, where there were more details of the problem, which
was solved by using MyProxy instead of delegation.

The problem is that the current delegation library is not able to handle new
style proxy certificates, which are generated by default with
'grid-proxy-init'.

See https://savannah.cern.ch/bugs/index.php?34026

We rarely experience this problem, because we usually use
voms-proxy-init, which still generates old style proxy
certificates by default.

The workaround is to use

     grid-proxy-init -old

One can reproduce the problem by generating an old style
proxy after a new style proxy:

$ grid-proxy-init
$ mv /tmp/x509up_u$(id -u) /tmp/grid-proxy
$ voms-proxy-init -cert /tmp/grid-proxy -key /tmp/grid-proxy
$ grid-proxy-info
subject  : /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=szamsu/CN=452476/CN=Akos Frohner/CN=201855275/CN=proxy

The problematic credential had similar DN (see CN=1234/CN=proxy):
/DC=org/DC=doegrids/OU=People/CN=Paul Rossman 364403/CN=117294575/CN=proxy

To implement this workaround we changed to delegation, not using myproxy in our FTS. You need to remove both the -passfile and the -myproxy options from the PhEDEx ConfigPart.FTSDownload configuration. None positive results.

Testing if srm is ok. Some transfers works fine:

srmcp -2 -debug=true srm://cmssrm.fnal.gov:8443/srm/managerv2?SFN=/11/store/PhEDEx_LoadTest07/LoadTest07_Debug_BR_SPRACE/US_FNAL/69/mediumfile.txt gsiftp://osg-se.sprace.org.br:2811//mdias/testes/mediumfile_from_fnal.gsiftp.osg-se.2 -protocols=gsiftp
srmcp -2 -debug=true  srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/test/smale.txt srm://osg-se.sprace.org.br:8443/pnfs/sprace.org.br/data/mdias/testes/smallfile_from_cern.osgse
srmcp -2 -debug=true  srm://gridka-dCache.fzk.de:8443/srm/managerv2?SFN=/pnfs/gridka.de/cms/test/mediufile.sh srm://osg-se.sprace.org.br:8443/srm/managerv2?SFN=/pnfs/sprace.org.br/data/mdias/testes/mediufile.txt

where

$ srmls  srm://osg-se.sprace.org.br:8443/pnfs/sprace.org.br/data/mdias/testes/smallfile_from_cern.osgse
  3033 /pnfs/sprace.org.br/data/mdias/testes/smallfile_from_cern.osgse
$ srmls  srm://osg-se.sprace.org.br:8443/pnfs/sprace.org.br/data/mdias/testes/mediumfile_from_fnal.gsiftp.osg-se.2
  616920 /pnfs/sprace.org.br/data/mdias/testes/mediumfile_from_fnal.gsiftp.osg-se.2

Another try: we changed in all gridftpdoor.batch files these parameters

set context -c gsiftpPoolManagerTimeout 5400
set context -c gsiftpMaxRetries 80

set context -c gsiftpPoolManagerTimeout 3600
set context -c gsiftpMaxRetries 3

to decrease some load in our gridftp and pnfs server.

More deep in our gridftp logs: comparing a failed transfer with a successful one, each stream dies with

02/27 09:59:41,214 FTP Door: Transfer error. Sending kill to pool spraid01_3 for mover 11950
02/27 09:59:41 Cell(GFTP-osg-se-Unknown-114@gridftp-osg-seDomain) :  CellMessage From   : [>spraid01_3@spraid01Domain:*@spraid01Domain:PoolManager@dCacheDomain:*@dCacheDomain]
02/27 09:59:41 Cell(GFTP-osg-se-Unknown-114@gridftp-osg-seDomain) :  CellMessage To     : [*@dCacheDomain:PoolManager@dCacheDomain:*@gridftp-osg-seDomain:>GFTP-osg-se-Unknown-114@gridftp-osg-seDomain]
02/27 09:59:41 Cell(GFTP-osg-se-Unknown-114@gridftp-osg-seDomain) :  CellMessage Object : (33)=Unexpected Exception : org.dcache.ftp.FTPException: Stream ended before EOD
02/27 09:59:41 Cell(GFTP-osg-se-Unknown-114@gridftp-osg-seDomain) : 
02/27 09:59:41,359 FTP Door: Transfer error. Removing incomplete file 000100000000000000D4ACF8: /pnfs/sprace.org.br/data/mdias/testes/mediumfile_from_fnal.gsiftp.osg-se.2_trivial
02/27 09:59:41,451 FTP Door: Failed to delete 000100000000000000D4ACF8: Not in trash: 000100000000000000D4ACF8
02/27 09:59:41,452 FTP Door: Transfer error: 451 Aborting transfer due to session termination

Insufficient number of streams? Let's improve it in our gridftpdoor.batch files (pool and server)

set context -c gsiftpMaxStreamsPerClient 20 #10
set context -c gsiftpMaxLogin                 300 #100

we tried again use our internal interface , to speed up

set context -c gsiftpAdapterInternalInterface 192.168.1.151  #was ""
set context -c gsiftpIoQueue                 WAN #was ""

and increased our memory in dCacheSetup

java_options="-server -Xmx2048m -XX:MaxDirectMemorySize=2048m #was 512m

also we shutdown our billing for a moment

billingToDb=no

Network tunning

In each server/pools, we still with our "default tunning" procedures:

$ more /etc/sysctl.conf
#Tunning
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 0 
# turns TCP timestamp support off, default 1, reduces CPU use
net.ipv4.tcp_syncookies = 1 
net.ipv4.tcp_sack = 0 
# turn SACK support off, default on
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216 
net.ipv4.tcp_wmem = 4096 87380 16777216
vm.min_free_kbytes = 65536
vm.overcommit_memory = 2

(We used RTT*Max_Bandwidth*1000/8 to guess these numbers, were Maximum bandwidth was 1000 and using 125 ms to FNAL ) where this changes can be made without reboot:

$ sysctl -p

We also changed

 
$/sbin/ifconfig eth0 txqueuelen 10000
$/sbin/ifconfig eth1 txqueuelen 10000

Debugging our Network link

To start a analysis, we a simple gridftp transfer:

globus-url-copy -vb    gsiftp://cmsstor89.fnal.gov:2811///WAX/11/store/PhEDEx_LoadTest07/LoadTest07_Prod_FNAL/LoadTest07_FNAL_B4 gsiftp://spraid01.sprace.org.br:2811//mdias/testes/fnal_test
Source: gsiftp://cmsstor89.fnal.gov:2811///WAX/11/store/PhEDEx_LoadTest07/LoadTest07_Prod_FNAL/
Dest:   gsiftp://spraid01.sprace.org.br:2811//mdias/testes/
  LoadTest07_FNAL_B4  ->  fnal_test
      3932160 bytes         0.13 MB/sec avg         0.13 MB/sec inst

Our traceroute is (at spraid01)

$ traceroute cmsstor89.fnal.gov
traceroute to cmsstor89.fnal.gov (131.225.205.211), 30 hops max, 38 byte packets
 1  200.136.80.1 (200.136.80.1)  0.413 ms  0.358 ms  0.332 ms
 2  143-108-254-241.ansp.br (143.108.254.241)  0.783 ms  0.738 ms  0.735 ms
 3  143-108-254-50.ansp.br (143.108.254.50)  0.991 ms  0.948 ms  1.037 ms
 4  ansp-whren-stm.ampath.net (198.32.252.229)  109.343 ms  109.513 ms  109.422 ms
 5  max-ampath.es.net (198.124.194.5)  140.242 ms  147.058 ms  140.413 ms
Icmp checksum is wrong
 6  clevcr1-ip-washcr1.es.net (134.55.222.57)  148.033 msIcmp checksum is wrong
  147.962 msIcmp checksum is wrong
  147.975 ms
Icmp checksum is wrong
 7  chiccr1-ip-clevcr1.es.net (134.55.217.54)  157.163 msIcmp checksum is wrong
  157.065 msIcmp checksum is wrong
  157.060 ms
Icmp checksum is wrong
 8  fnalmr1-ip-chiccr1.es.net (134.55.219.122)  158.444 msIcmp checksum is wrong
  158.557 msIcmp checksum is wrong
  158.798 ms
 9  fnalmr2-ip-fnalmr3.es.net (134.55.41.42)  158.240 ms  158.397 ms  158.205 ms
10  te4-2-esnet.r-s-bdr.fnal.gov (198.49.208.230)  158.373 ms  158.360 ms  158.376 ms
11  131.225.15.201 (131.225.15.201)  158.558 ms  158.443 ms  158.439 ms
12  vlan608.r-s-hub-fcc.fnal.gov (131.225.102.3)  158.424 ms  158.318 ms  158.325 ms
13  s-cms-fcc2.fnal.gov (131.225.15.54)  159.239 ms  159.613 ms  159.146 ms
14  cmsstor89.fnal.gov (131.225.205.211)  159.448 ms  159.104 ms  158.964 ms

We checked if we have hardware problems, looking for errors , dropped, overuns, frame or carrier failures at

# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:11:43:E5:06:3A  
          inet addr:200.136.80.6  Bcast:200.136.80.255  Mask:255.255.255.0
          inet6 addr: fe80::211:43ff:fee5:63a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:89324608 errors:0 dropped:0 overruns:0 frame:0
          TX packets:107777879 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:10000 
          RX bytes:2694181555 (2.5 GiB)  TX bytes:4063241173 (3.7 GiB)
          Base address:0xecc0 Memory:df9e0000-dfa00000

Some additional changes done:

ethtool -g eth0
ethtool -G eth1 rx 4096

Looking for package loss:

We installed bing locally:

wget http://debian.inode.at/debian/pool/main/b/bing/bing_1.1.3.orig.tar.gz
tar -xvzf bing_1.1.3.orig.tar.gz
cd bing_1.1.3
make
su

Using = -S = , type ctr-c to stop:

./bing -S 100 200.136.80.5 cmssrm.fnal.gov

Same errors at

 mtr cmssrm.fnal.gov -s 1000

Using this we estimate our package loss (osg-se) and we compare with shell.ift.unesp.br

for ((i=1024;i< 65507 ;i+=1024)); do export loss=`ping cmssrm.fnal.gov -c 20 -s $i |grep loss|cut -d' ' -f6`; echo $i $loss; done

in this graph:

Also checked our network card speed configuration:

ethtool eth0
    Speed: 1000Mb/s

may be due our normal activities this link is full.

Our throughput is limited by the slowest bandwidth between two routers in our path. Lets first check our path to a machine located at FNAL, using Pathchar

[root@osg-se mdias]# ./pathchar cmssrm.fnal.gov
pathchar to cmssrm.fnal.gov (131.225.207.12)
 can't find path mtu - using 1500 bytes.
 doing 32 probes at each of 45 sizes (64 to 1500 by 32)
 0 localhost
 |   341 Mb/s,   157 us (350 us)
 1 200.136.80.1 (200.136.80.1)
 |   ?? b/s,   191 us (729 us)
 2 143-108-254-241.ansp.br (143.108.254.241)
 |   981 Mb/s,   83 us (0.91 ms),  3% dropped
 3 143-108-254-50.ansp.br (143.108.254.50)
 |    63 Mb/s,   54.2 ms (109 ms),  4% dropped
 4 ansp-whren-stm.ampath.net (198.32.252.229)
 |   ?? b/s,   3.86 ms (148 ms),  16% dropped

an the same to FZK

[root@osg-se mdias]# ./pathchar gridka-dCache.fzk.de
pathchar to gridka-dCache.fzk.de (192.108.45.38)
 can't find path mtu - using 1500 bytes.
 doing 32 probes at each of 45 sizes (64 to 1500 by 32)
 0 osg-se (200.136.80.5)
 |   278 Mb/s,   157 us (356 us)
 1 200.136.80.1 (200.136.80.1)
 |   ?? b/s,   190 us (727 us)
 2 143-108-254-241.ansp.br (143.108.254.241)
 |   ?? b/s,   104 us (0.93 ms),  6% dropped
 3 143-108-254-50.ansp.br (143.108.254.50)
 |    56 Mb/s,   54.2 ms (109 ms),  7% dropped
 4 ansp-whren-stm.ampath.net (198.32.252.229)
 |   ?? b/s,   6.98 ms (123 ms),  7% dropped

ERROR FOUND

It was a problem in our optical link (was found by Allan, paying attention to led messages that, he founds a lot of CRC errors).