Monitoring cron jobs

Cron jobs sometimes fail and the old way of getting emails from the cron daemon doesn’t really scale. For instance you might have a job that fails from time to time and that’s ok - but fail for too long and it’s a problem. Generally, email as an alerting tool is a Bad Thing and should be avoided.

Since I have prometheus set up for everything, the easiest thing is to use the textfile-collector from node exporter to dump some basic stats. And since I don’t want to write this all over and over, I wrap most cron jobs in the script at the end of this post.

For most cron jobs I want the following things: the exit code, when it last ran, how long it ran for and I only want one instance of it to run at a time. For the first three of those I export vars to prometheus, for that last one I use flock.

Lastly when things go wrong there are some things I want to do. I want to be able to rerun a failed job after I’ve fixed something and I want to see why the job failed. For the latter one I pipe all output from the script to logger which prefixes each log line with a given bit of text to make it easier to find.

For both of these scenarios the runner-log script will generate two scripts: /tmp/rerun-$job and /tmp/check-$job. In terms of security it would probably be better if these were written somewhere else, but this works for now. When writing the alert, putting references to those scripts makes it a bit faster to debug.

I put runner-log and all the jobs it calls in /etc/cron.d/scripts. It just makes it easier to find “all things cron” inside of /etc/cron.d rather than scattered around the file system in possibly “more correct” locations.

One last note is that I use batch in the rerun script and in more places as time goes on because it solves a number of issues.

#!/bin/bash

start=$(date +%s)
textfile_dir=$(dirname "$0")
job="$1"
log_prefix="$2"
shift; shift
if [[ -z "$job" || -z "$log_prefix" ]]; then
  echo "ERROR: Missing arguments."
  exit 1
fi
script="$textfile_dir/$job"
runner_prom_file="/var/lib/node_exporter/textfile_collector/runner_${job%.*}".prom

# Record how to check and/or rerun this script.
# Note that if the script takes complex arguments this won't work. $* and $@
# are for all purposes indistinguishable in a heredoc.
cat > "/tmp/rerun-$job" << EOF
#!/bin/bash

echo "/etc/cron.d/scripts/runner-log $job $log_prefix $*" | sudo -u $(whoami) batch
EOF
cat > "/tmp/check-$job" << EOF
#!/bin/bash

grep "$log_prefix" /var/log/syslog \
  || grep "$log_prefix" /var/log/syslog.1
EOF
chmod 755 "/tmp/rerun-$job" "/tmp/check-$job"

# Run the script.
if [[ ! -x "$script" || -d "$script" ]]; then
  echo "ERROR: Can't find script for '$job'. Aborting." \
    | logger -t "$log_prefix"
  exit 1
fi

if [[ ! -d /var/run/node-exporter ]]; then
  echo "ERROR: /var/run/node-exporter missing; needed for '$job'. Aborting."
  exit 1
fi
lockfile="/var/run/node-exporter/.lock.$job"
flock -n -E 0 "$lockfile" "$script" "$@" 2>&1 | logger -t "$log_prefix"

# Get results and clean up.
runner_exit=${PIPESTATUS[0]}
finish=$(date +%s)
duration=$(( finish - start ))
cat > "$runner_prom_file".tmp << EOF
# HELP runner_exit Exit code of runner.
# TYPE runner_exit gauge
runner_exit{script="$job"} $runner_exit
# HELP runner_last Time latest run finished.
# TYPE runner_last gauge
runner_last{script="$job"} $finish
# HELP runner_duration Duration of latest run.
# TYPE runner_duration gauge
runner_duration{script="$job"} $duration
EOF
mv "$runner_prom_file".tmp "$runner_prom_file"