Debugging Celery Issues in Django

Lockdown has ended in Melbourne and we’re able to resume mingling, gossiping, and chattering. Just before we could get off our seats and out the door though, Celery (our distributed task queue) jumped the gun and had us all scratching our heads on a recent issue we uncovered at Kogan.

Most Django developers will use Celery at least to manage long running tasks (as in longer than what’s reasonable in a web request/response cycle). It’s easy to set up and easy to forget that it’s running, until of course something goes wrong.

We observed that at around midnight, hundreds of gigabytes of data was recorded as ingress to our RabbitMQ node hosted on AWS, wreaking havoc on available memory and CPU utilisation.

We continued to investigate. CloudWatch metrics unveiled that the data mass was originating from our worker autoscale group, narrowing the search down.

Introducing Celery mingling.

Celery keeps a list of all revoked tasks, so when a revoked task comes in from the message queue it can be quickly discarded. Since Celery can be distributed, there’s a feature enabled by default called mingling which enables Celery nodes to share their revoked lists when a new node comes online. On paper, this sounds like a good thing: If a task is revoked then it shouldn’t be executed! Our use case doesn’t involve revoking tasks so it seemed harmless - if by chance we wanted to revoke a task manually we’d be able to, knowing that it will propagate to all nodes.

Unfortunately there’s more to the revoked list than meets the eye. Here’s a snippet from Celery:

def maybe_expire(self):
        """If expired, mark the task as revoked."""
        if self._expires:
            now = datetime.now(self._expires.tzinfo)
            if now > self._expires:
                revoked_tasks.add(self.id)
                return True

Adding an expiration to tasks is something we do a lot as there are a lot of time sensitive actions to do. If we’ve missed the window, we shouldn’t run the task.

The above snippet shows that when an expired task comes in it gets added to the revoked list! As a result, when a new Celery node came online our existing workers were eager to share their 250MB lists with the new node. Keep in mind that this list is just a list of UUIDs: we have a lot of tasks executing! We quickly turned off this feature after we observed this behaviour. We also noticed that a lot of workers were restarting at midnight - 250MB multiplied by 30 workers restarting is a lot of handshakes and a lot of data!

Looking through the supervisor logs to find the cause of the restarts initially gave a red herring; processes were exiting with exitcode 0. Surprisingly, Celery will also exit 0 on an unrecoverable exception so we started looking through Sentry for anything suspicious. We uncovered this exception:

TypeError: __init__() missing 1 required positional argument: 'url'

The rest of the trace was unhelpful due to the coroutine nature of its source. At a glance the exception appears to be a bug on our end, but looking at the source reveals it to be a pickle deserialization error.

Ultimately, we found the issue was not an unpickleable class, but an unpickable exception being passed to the retry mechanism. We filed an issue and removed all custom exceptions from the retry method.

If you’re running Celery with a lot of fine tuning with task expiration, we recommend turning off mingling. We’d also recommend not passing custom exceptions into Celery’s retry mechanism and instead log exceptions where they’re initially raised.

Django Test Splitting on Circle CI

Django Test Splitting on Circle CI

Django Test Splitting on Circle CI

One of the most important things you can do for your development teams' productivity is to shorten the feedback loop during development. This applies to getting feedback from customers or stakeholders to ensure you're building the right thing, as much as it does to testing the code you're writing to ensure no bugs have crept into your change set.

Today we're focusing on the Development - Test feedback loop. Developing a change and running your regression test suite to validate that change should be as fast as possible.

We've used three different systems to run our unit tests over the last few years. Beginning with a self-hosted Jenkins instance, transitioning to Travis CI, and finally arriving at Circle CI.

Monitoring Celery Queue Length with RabbitMQ

Monitoring Celery Queue Length with RabbitMQ

Earlier this week Matthew Schinckel wrote a post about how he monitors Celery queue sizes with a Redis backend.

RabbitMQ (https://www.rabbitmq.com/ is also a popular backend for Celery, and it took us a long time to get good visibility into our task queues and backlogs. So we'd like to share our solution for monitoring RabbitMQ celery queues for others that might be in a similar situation.

Management Command Aliases

Management Command Aliases

Django Management Commands

A quick minor efficiency tip for today that'll make your life at the Django command line incrementally better.

You're using django-extensions in your project, right? If not, go and rectify that problem immediately.

The best feature™ that django-extensions provides is the shell_plus command. You know the worst thing about `shell_plus`? The name. That's a long name to type out into your shell, especially since it's snake_case, and I type the command approximately 15,000 times per day. Wouldn't it be nice to alias `shell_plus` to something like `sh`? Well, now you can!

Django on Kubernetes

Django on Kubernetes

This is part 2 of our Kubernetes hackday series. Part 1 goes over how we spent the day and what the goals and motivations are.


For part 2, we're going to delve into the architecture required for running a Django application on Kubernetes, as well as some of the tooling we used to assist us.
This post will assume some knowledge with deploying and operating a production web application. I'm not going to spend much time going over the terminology that Kubernetes uses either. What I hope to do in this post is present enough information to kickstart your own migration. Kubernetes is big - and knowing what to research now and what to put off until later is really tricky.
I'm also going to describe this deployment in terms of Amazon EKS simply because I was on the EKS team, and most of our applications are already on AWS.

Faster Django Tests by Disabling Signals

Django signals are a form of Inversion Of Control that developers can use to trigger changes based (usually) on updates to models.

The canonical example is automatically creating a UserProfile when a User instance is created:

from django.conf import settings
from django.db.models.signals import post_save
from django.dispatch import receiver

@receiver(post_save, sender=settings.AUTH_USER_MODEL)
def post_save_receiver(sender, instance, created, **kwargs):
    UserProfile.objects.create(user=instance, ...)

Signals are useful for connecting your own behaviour to events that you might not have control over, such as those generated by the framework or libary code. In general, you should prefer other methods like overriding model save methods if you have the ability, as it makes code easier to reason about.

Signal receivers can sometimes be much more complex than creating a single object, and there can be many receivers for any given event, which multiply the time it takes to perform simple actions.

Tests and signals

Often you won't actually need your signals to execute when you're running your test suite, especially if you're creating and deleting many thousands of model instances. Disconnecting signals is tricky though, especially if the disconnect/reconnect logic can be stacked.

An easier, but more invasive method for suspending signal receivers is to check a global setting, and return early.

from django.conf import settings

@receiver(post_save, sender=MyModel)
def mymodel_post_save(sender, **kwargs):
    if settings.DEBUG:
        return
    work()

This has two drawbacks. Firstly, it's messy, and you need to remember to add the check to each receiver. Secondly, it depends on a specific setting that you might need to have turned off when running your tests, which makes it harder to test the actual receiver.

Selectively disabling signals

We can take this most recent idea of checking a setting, and wrap it up nicely in our own decorator. When running your tests, you can override the SUSPEND_SIGNALS setting per test method or class.

Here's a gist that does just that

import functools

from django.conf import settings
from django.dispatch import receiver

def suspendingreceiver(signal, **decorator_kwargs):
    def our_wrapper(func):
        @receiver(signal, **decorator_kwargs)
        @functools.wraps(func)
        def fake_receiver(sender, **kwargs):
            if settings.SUSPEND_SIGNALS:
                return
            return func(sender, **kwargs)
        return fake_receiver
    return our_wrapper

And using this decorator in your test suite is straightforward:

@override_settings(SUSPEND_SIGNALS=True)
class MyTestCase(TestCase):
    def test_method(self):
        Model.objects.create()  # post_save_receiver won't execute

@suspendingreceiver(post_save, sender=MyModel)
def mymodel_post_save(sender, **kwargs):
    work()

And just like that, we can skip signal receivers in our test suite as we like!