Earthworm statmgr replacement

At ROB, we don’t use Earthworm (ew) for real time acquisition or monitoring, but we have installed it on our computer at the Kawah Ijen observatory in order to group all seismic fluxes on a single machine, and thus a single archive. Sometimes, for random reasons, some ew modules fail and crash and are marked “Zombie” or “Dead” when executing the “status” command. I have set up statmgr, but for some reason, it never worked. So I’ve written my own cron script that would execute “status“, get its output and restart crashed modules.

The cron job looks like this:

# m h  dom mon dow   command
* * * * * bash /home/seismo/cron_ew.sh >> /home/seismo/ew_restart.log 2>&

Every minute of every hour of every day, the cron_ew.sh script is executed by bash and its stdout and stderr are appended in ew_restart.log.

The content of cron_ew.sh is:

cd /home/seismo
. /home/seismo/ew/run_working/params/ew_linux.bash
source /home/seismo/.bashrc
python ijen_ew_restart.py

The second line executes the ew_linux.bash script which contains all environment variables needed for ew to run. The third line might not be necessary, but it doesn’t hurt. The last line executes the python script:

import subprocess
import sys, os
cmd = '/home/seismo/ew/earthworm_7.4/bin/status'
process = subprocess.Popen(
    cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE
)

stdout, stderr = process.communicate()
tmp = stdout.split('\n')[21:-2]
#print tmp
elements = []
for t in tmp:
    t = t.split(' ')
    out = []
    for ti in t:
        if len(ti) != 0:
            out.append(ti)
    elements.append(out)


for element in elements:
    process, pid, status = element[:3]
    if status =='Dead' or status == 'Zombie':
        print process, 'is %s, restarting' % status,  pid
        os.system("restart %s" % str(pid))

After some imports, the code executes the status command, gets its output and parses it (the hard way, not the clean way). Finally, it loops over the processes and restarts the Zombie or Dead ones.

Note: this version of the script doesn’t handle the case when ew is completely crashed. This could be achieved by checking the length of the output of status.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*