Safe and Robust Scheduled Tasks in Rails

I recently added a feature to a SaaS app that I am building that sends out a daily summary email to each user. It is encapsulated as a rake task which is invoked once a day at 7:30am using the Heroku Scheduler add-on, which provides a simplistic cron-like interface for free.

In its first incarnation this feature had two critical flaws:

  • If the task failed (processes will occasionally segfault on Heroku) then it would not be retried until the following day (lack of robustness).
  • If the scheduler invoked the task multiple times (a real possibility), then users would be emailed multiple times.

In the former case, the worst case is that users do not receive their daily summary emails that day. As this is a key feature of the system, this is important but not absolutely critical, as they can still log in to see what they missed.

The latter however could be critical indeed. Here’s an excerpt from one of Patrick McKenzie’s war stories:

I had several very irate emails from customers, who had just had their morning appointments come in and complain about getting contacted by Appointment Reminder. Repeatedly. See, for the several hours that the queued workers were down, a cron job kept saying “Who has an appointment tomorrow? Millie Smith? Have we called Millie Smith yet? OK then, queuing a call for Millie Smith and ignoring her for 5 minutes while that call takes place.” There are an awful lot of 5 minute intervals in several hours, and the queue was not idempotent, so Millie Smith got many, many calls queued for her.

As soon as I hit “go”, the backed up queue workers blasted through 600 calls, 400 SMSes, and 200 emails, and my website and Twilio received an impromptu stress test. We passed with flying colors. Millie Smith’s phone, on the other hand, did not. The worst affected user got 40 calls, back to back, essentially DDOSing their phone line for 15 minutes straight.

They key then is to make our scheduled tasks both robust against intermittent failure (e.g. random segfaults) and idempotent (i.e. for a given input, in this case a date, the task will only perform its actions once).

An additional requirement I’m adding is that it shouldn’t increase the cost of hosting the service; spinning up a worker process to poll constantly for a once-a-day event would naturally be overkill.

The Proposed Solution

To protect against the chance of occasional task failures, the task will be scheduled to run at 7:30am (the desired time) and again at 8:00am and 8:30am, giving us two possible retries per day. If the task fails consistently three times in a row over a 60 minute period then there is likely something with this part of the system that needs investigating.

To allow for multiple invocations the task will need to check the last time it successfully ran and ensure that this time is at least n hours in the past. I want to allow a four-hour window each morning for manual interventions, just in case, so I will set n to 20.

To keep things very simple I’ll be using a key-value store, Kue, that is built on ActiveRecord and supports the storage of objects such as a Time instance in our case.

The library hasn’t been updated for some time but due to its simplicity updating it for use with Rails 4.1 was straightforward, fork is available here.

Integration

Add to your Gemfile:

gem 'kue', github: "shaflidason/kue"

And run:

$ bundle install
$ rails generate kue:install
$ rake db:migrate

Now protecting the method which handles sending of the emails can be implemented trivially:

def self.send_daily_summaries
  # PROTECT: Should only send summaries if last successful send
  #          was >= 20 hours before now.
  safe_to_send = false

  if KueStore.exists? :daily_summary_last_sent_at
    last_sent_at = KueStore[:daily_summary_last_sent_at]

    if (Time.now - last_sent_at) >= 20.hours
      safe_to_send = true
    end
  else
    # No value stored, must be the first run.
    safe_to_send = true
  end

  if safe_to_send
    self.all.each do |account|
      account.users.each do |user|
        Notifier.daily_summary(account, user).deliver
      end
    end

    KueStore[:daily_summary_last_sent_at] = Time.now
  else
    logger.info "Daily summaries were not sent on this invocation as insufficient time has passed since last send"
  end
end

The emails will therefore only be sent if either we have no value recorded for the last send time (first run), or 20 hours have elapsed since the time of the last send.

Deployment to Production

All that remains is to set up the two additional Heroku scheduler jobs and the task will then be protected against up to two random failures per day and multiple mistaken invocations of the task. Note that it is the method itself and not just the rake task that performs the safety checks, meaning that mistaken invocations from the console will not result in emails being sent when they should not be.

Conclusions

It is vitally important to add guards to any actions in your system which, if invoked repeatedly, would have unpleasant side-effects such as causing massive volumes of emails (or even phone calls in Patrick’s case) to be sent/placed, seriously inconveniencing your customers, or worse, your customer’s customers.

The method presented here could be made even more robust (what if there are three failures on a given day, or Heroku has issues for the entire time window?) but any such implementation will require the integration of more complex libraries (which are likely overkill for sending some emails once a day) and would also require the cost of an additional process to run, and do very little, all month. The added complexity would also require further guards to be put in place, to guard against new modes of failure that have then been introduced.

Finally, be sure to write tests to exercise this behaviour–it is too important to be left to chance. Here’s an example using RSpec:

require 'spec_helper'
require 'rake'

describe 'scheduler:send_daily_summaries' do
  PracticeManager::Application.load_tasks

  before {
    create(:account_with_schema)
    create(:account_with_schema)

    @task = Rake::Task['scheduler:send_daily_summaries']
  }

  after {
    # Tasks can only run once per session, so re-enable
    # each time
    @task.reenable

    # Ensure deliveries from past tests are cleared
    ActionMailer::Base.deliveries.clear
  }

  it { expect { @task.invoke }.not_to raise_exception }

  it "sends to all users" do
    @task.invoke

    expect(ActionMailer::Base.deliveries.count).to eql(Subscriptions::User.count)
  end

  it "sends at most once a day" do
    @task.invoke

    expect(ActionMailer::Base.deliveries.count).to eql(Subscriptions::User.count)

    @task.reenable
    @task.invoke

    expect(ActionMailer::Base.deliveries.count).to eql(Subscriptions::User.count)

    # Fast-forward 20 hours; the emails should then send again
    Timecop.travel(Time.now + 20.hours) do
      @task.reenable
      @task.invoke

      expect(ActionMailer::Base.deliveries.count).to eql(Subscriptions::User.count * 2)
    end
  end
end