This document started out as a proposal for mid-commit hooks on the ZODB mailing list, but I realized while writing it that mid-commit hooks were flawed. ZODB is a transactional object-oriented database. It uses a transaction package that provides a two-phase commit abstraction for Python.
Often, when working with transactions, we need to do non-transactional things (NTTs), like sending email (or SMS or push-notification) messages. We typically only want to do these NTTs if the transaction suceeds, and, ideally, we want to abort the transaction if a NTT fails. Amongst the common approaches for this:
Define a transaction-aware data manager that does the NTT. (This was my terrible idea originally and is a good example of over-generalization. I'm sorry.) This typically works by doing the NTT in the first phase of two-phase commit, ideally by arranging that it's processed after transactional data managers. This way, all of the transactional data managers have promised that they'll commit their data, but will abort of the NTT fails.
This approach has some disadvantages:
- It's complicated.
- It doesn't work well if there are multiple NTTs. If one succeeds and the second fails, the transaction is aborted, but there's no way to roll back the NTT that succeeded.
- If the NTT takes a long time, it can make the transaction take a long time to commit, having adverse effects on the databases involved. This is, potentially, a huge problem.
Do NTTs in after-commit hooks.
This has the major disadvantage that database changes are committed ragardless of whether the NTT succeeds. It's very difficult to recover from these failures.
Use a transactional queue. If NTT failures are transient, we can write a description of an NTT to a transactional queue, essentially making it transactional. Then an external process or thread processes the queue, retrying the NTTs until they succeed.
This apprach is the most appealing, but it can be devilishly difficult to get right. We (ZC) have some experience with this, with zc.async, which was so complex that only it's author understood it. Persistent queues seem to draw complexity:
Generalizions to deal with application code to handle queued tasks.
Failure handling
This is probably the most difficult aspect. You need to prevent the same job from being dome multiple times, but you also need to make sure a job is actually done.
Monitoring.
In reaction to this, we started using SQS and queuing jobs in after-commit hooks. This takes a lot of complexity off our sholders and has worked pretty well, however, it has the problem that we sometimes fail to get jobs into SQS due to transient failures. This issue isn't unique to SQS. We'd have the same problem if we used a different queuing system, like Celery.
Perhaps we should pursue a hybrid approach. Use a highly specialized persistent queue to queue jobs to be added to SQS (or some other queueing service). This queue would only contain one kind of job and these jobs would be expected to complete quickly and would time out very quickly. Ideally, a transaction that queues one of these jobs could have an after-commit hook that immediately processes the job and removes it from the queue on success (and unclaims it on failure). These queues should be small and can use a simple data structure, like a BTree bucket with queue times as keys, or a BTree that grows when things get very bad, but shrinks back to a single bucket under normal situations.