Brain Phrye

code cooking diy fiction personal photos politics reviews tools 


Monitoring cron jobs

Cron jobs sometimes fail and the old way of getting emails from the cron daemon doesn’t really scale. For instance you might have a job that fails from time to time and that’s ok - but fail for too long and it’s a problem. Generally, email as an alerting tool is a Bad Thing and should be avoided.

Since I have prometheus set up for everything, the easiest thing is to use the textfile-collector from node exporter to dump some basic stats. And since I don’t want to write this all over and over, I wrap most cron jobs in the script at the end of this post.

For most cron jobs I want the following things: the exit code, when it last ran, how long it ran for and I only want one instance of it to run at a time. For the first three of those I export vars to prometheus, for that last one I use flock.

Lastly when things go wrong there are some things I want to do. I want to be able to rerun a failed job after I’ve fixed something and I want to see why the job failed. For the latter one I pipe all output from the script to logger which prefixes each log line with a given bit of text to make it easier to find.

For both of these scenarios the runner-log script will generate two scripts: /tmp/rerun-$job and /tmp/check-$job. In terms of security it would probably be better if these were written somewhere else, but this works for now. When writing the alert, putting references to those scripts makes it a bit faster to debug.

I put runner-log and all the jobs it calls in /etc/cron.d/scripts. It just makes it easier to find “all things cron” inside of /etc/cron.d rather than scattered around the file system in possibly “more correct” locations.

One last note is that I use batch in the rerun script and in more places as time goes on because it solves a number of issues.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/bin/bash

start=$(date +%s)
textfile_dir=$(dirname "$0")
job="$1"
log_prefix="$2"
shift; shift
if [[ -z "$job" || -z "$log_prefix" ]]; then
  echo "ERROR: Missing arguments."
  exit 1
fi
script="$textfile_dir/$job"
runner_prom_file="/var/lib/node_exporter/textfile_collector/runner_${job%.*}".prom

# Record how to check and/or rerun this script.
# Note that if the script takes complex arguments this won't work. $* and $@
# are for all purposes indistinguishable in a heredoc.
cat > "/tmp/rerun-$job" << EOF
#!/bin/bash

echo "/etc/cron.d/scripts/runner-log $job $log_prefix $*" | sudo -u $(whoami) batch
EOF
cat > "/tmp/check-$job" << EOF
#!/bin/bash

grep "$log_prefix" /var/log/syslog \
  || grep "$log_prefix" /var/log/syslog.1
EOF
chmod 755 "/tmp/rerun-$job" "/tmp/check-$job"

# Run the script.
if [[ ! -x "$script" || -d "$script" ]]; then
  echo "ERROR: Can't find script for '$job'. Aborting." \
    | logger -t "$log_prefix"
  exit 1
fi

if [[ ! -d /var/run/node-exporter ]]; then
  echo "ERROR: /var/run/node-exporter missing; needed for '$job'. Aborting."
  exit 1
fi
lockfile="/var/run/node-exporter/.lock.$job"
flock -n -E 0 "$lockfile" "$script" "$@" 2>&1 | logger -t "$log_prefix"

# Get results and clean up.
runner_exit=${PIPESTATUS[0]}
finish=$(date +%s)
duration=$(( finish - start ))
cat > "$runner_prom_file".tmp << EOF
# HELP runner_exit Exit code of runner.
# TYPE runner_exit gauge
runner_exit{script="$job"} $runner_exit
# HELP runner_last Time latest run finished.
# TYPE runner_last gauge
runner_last{script="$job"} $finish
# HELP runner_duration Duration of latest run.
# TYPE runner_duration gauge
runner_duration{script="$job"} $duration
EOF
mv "$runner_prom_file".tmp "$runner_prom_file"