Let's Talk About Cron

Laying awake at night, you bolt up from an unsettling dream, as if there is something wrong in the universe and your subconscious has realized it. Perhaps the masters of time are out of balance, or the dictators of our automated universe are making their needs known. "Oh no," you wonder. "Could it be... cron?"

Whether it's the start of a nightmare or the beginning of a day in reality, automating our workflows is something that we as developers do every day. Like brushing teeth, we probably copy paste from an instruction manual or previous script, and get on with our day. But what if our behavior has led to an imbalance? Or in other words (what I woke up this morning pondering) is what kind of collective decisions are we making with respect to cron?

For those that aren't familiar, GitHub workflows has an ability to schedule a workflow, meaning we can use cron to say "Run this once a week, on Sunday, at 2:00am" and pretty much any interval that you'd like.

The problem arises because (I suspect) people are uncreative with their choices of time. Even worse, they probably just copy paste from the GitHub (or other) instructions. To test my feeling of dread that we are making sloppy decisions about our cron, I did a small analysis. I created a simple project to query for a sample of 1000 GitHub workflow files that use cron, and then take a look at some details. This post that lives alongside the repository is what I found.

At what frequency?

For approximately 1000 GitHub workflow files with scheduled events, I looked at the frequency of the schedule. By far, the most common frequencies are once a week, and once a day (regardless of time).

The "4 weeks 3 days" is what we would call a month, and "52 weeks 1 day" is obviously a year. I figured out these frequencies by first generating two timestamps for a cron entry, calculating the difference between the two in seconds, and then converting the seconds into weeks, days, hours, and minutes. I'm really interested to know what jobs are being run every minute, or even 15 or 30 minutes.

At what time of day?

Surely once a day or week isn't bad as long as we space the running times out into equal buckets, right? Unfortunately if we look at the timestamps, we see that most people are using the easiest example to look up - 00:00 (they might assume midnight), and then another peak at 7:00 UTC and a second at 12:00 UTC. Did I mention these times are in UTC? I wonder how many people look that up? Surely I just assume that I'm the center of the universe and GitHub can magically derive my timezone from my commits and... yeah, probably not. :)

If I had to guess, I'd say the first group copy pasted an example from crontab.guru without considering the times are in UTC (00:00 UTC is 5:00pm in San Francisco, which I mention since a lot of tech is there), the second group might have actually considered UTC, and chose the correct midnight (7:00 UTC is 9:00am in Germany, midnight in San Francisco), and the third group likely thought about it too (12:00 UTC is 5:00am San Francisco, 10:00pm in Australia). It's hard to say, but probably most people setting actions don't consider converting from UTC, or maybe don't care. What this chart doesn't do a good job of showing are the times that are never (or almost never) scheduled. They wouldn't even appear as bars.

What day of the week?

Before I looked at the data, my guess would have been that people want to run jobs on weekend days, since that's when people aren't working. But I was surprised to see that the most common day is "every day," followed by Monday and then Sunday.

At least every day makes a lot of sense - and I feel silly for not realizing that a good chunk of my own runs I do on a nightly basis. Sunday I understand, but Monday? Is this another copy pasta artifact or is Monday really ideal? And if so, why?

What about edge cases?

To be fair, to do this analysis I converted cron schedule strings into timestamps and descriptions, and did more parsing. If you want to see the raw data you can see it in the data folder, and here I'll show you raw descriptions. It's actually much easier to read directly from the data file if you really are interested.

Yeah I know it's kind of small, it's getting really late and I don't have the energy left to do a different kind of plot. Please use the data and make a better one and submit a pull request! If you are truly lazy and don't want to click, if you carefully mouse over the strings, you'll at least see a glimpse of values - everything from "At 12 minutes past the hour, every 6 hours" to "At 03:11 every day." It kind of makes me wonder if people thought about these scheduled events or just entered numbers at random.

What did I learn?

From this very brief, just-a-few-hours-before-bed-in-my-pajamas analysis, I sense that:

At face value, it seems like we are being a little sloppy, at least from this random sample (the 1000 most recently indexed files with a cron: schedule entry). I mean, how should GitHub handle many more requests to run jobs on specific days per week or times? Have we even validated that the jobs are actually run at those times? Couldn't there just be an option to say "run for me once a (X)" and then let GitHub choose the day/month/hour that takes into consideration all the other jobs? Or maybe GitHub is such a monster now that it's backed by Microsoft that there are infinite resources for running things. I would suspect cron jobs are just a tiny fraction of all the jobs they run, which are going at all times. But there are so many more questions we could ask!

I'm particularly interested in the second point, or at least this feels like one I could tackle on my own. The first is interesting but would be a project for another day, and the third I'm not sure I could answer. So what do you think? Can we be more responsible with our cron-jobbing, or is this a nightmare that is likely to continue and will continue to keep a dinosaur like myself awake at night to wonder... "Something is not right. Could it be Cron?" 👻