Cron jobs sometimes fail and the old way of getting emails
from the cron daemon doesn’t really scale. For instance you might
have a job that fails from time to time and that’s ok - but fail
for too long and it’s a problem. Generally, email as an alerting
tool is a Bad Thing and should be avoided.
Since I have prometheus set up for everything, the easiest thing
is to use the textfile-collector from node exporter to dump
some basic stats. And since I don’t want to write this all over and over,
I wrap most cron jobs in the script at the end of this post.
For most cron jobs I want the following things: the exit code, when
it last ran, how long it ran for and I only want one instance of
it to run at a time. For the first three of those I export vars to
prometheus, for that last one I use flock.
Lastly when things go wrong there are some things I want to do. I
want to be able to rerun a failed job after I’ve fixed something
and I want to see why the job failed. For the latter one I pipe
all output from the script to logger which prefixes each log
line with a given bit of text to make it easier to find.
For both of these scenarios the
runner-log script will generate
/tmp/check-$job. In terms of
security it would probably be better if these were written somewhere
else, but this works for now. When writing the alert, putting
references to those scripts makes it a bit faster to debug.
runner-log and all the jobs it calls in
It just makes it easier to find “all things cron” inside of
/etc/cron.d rather than scattered around the file system in
possibly “more correct” locations.
One last note is that I use
batch in the
rerun script and in
more places as time goes on because it solves a number of issues.
if [[ -z "$job" || -z "$log_prefix" ]]; then
echo "ERROR: Missing arguments."
# Record how to check and/or rerun this script.
# Note that if the script takes complex arguments this won't work. $* and $@
# are for all purposes indistinguishable in a heredoc.
cat > "/tmp/rerun-$job" << EOF
echo "/etc/cron.d/scripts/runner-log $job $log_prefix $*" | sudo -u $(whoami) batch
cat > "/tmp/check-$job" << EOF
grep "$log_prefix" /var/log/syslog \
|| grep "$log_prefix" /var/log/syslog.1
chmod 755 "/tmp/rerun-$job" "/tmp/check-$job"
# Run the script.
if [[ ! -x "$script" || -d "$script" ]]; then
echo "ERROR: Can't find script for '$job'. Aborting." \
| logger -t "$log_prefix"
if [[ ! -d /var/run/node-exporter ]]; then
echo "ERROR: /var/run/node-exporter missing; needed for '$job'. Aborting."
flock -n -E 0 "$lockfile" "$script" "$@" 2>&1 | logger -t "$log_prefix"
# Get results and clean up.
duration=$(( finish - start ))
cat > "$runner_prom_file".tmp << EOF
# HELP runner_exit Exit code of runner.
# TYPE runner_exit gauge
# HELP runner_last Time latest run finished.
# TYPE runner_last gauge
# HELP runner_duration Duration of latest run.
# TYPE runner_duration gauge
mv "$runner_prom_file".tmp "$runner_prom_file"