Tip #987: Kill a few millions async jobs

When Jonas “Surströmming MkII” Rapp said that “someone should probably get fired”, he was careful to qualify the statement with “asking for a friend”. Sure, we’re here to help, friend or foe.

Jonas

Some integration job happened to touch fields that should not have been touched, and triggered a few async plugins a couple of million times. Async server is short-cutting from sweat, SQL ready to commit suicide, UI taking a nap while her friends are busy staying alive.

The only thing I can think of to fix it in reasonable time is to stop async service, run SQL update on all jobs to set status to cancelled, and then restart async service.

Any ideas or advice for me?

Truck to the rescue

Tipster truckThe replies came thick and fast. Below is the readers digest version. Thanks to Joel, Mitch Milam, Andrii Butenko, and David Jennaway for their throughtful contributions.

Supported option or not, if on premises, then the first thing you’d want to do is to stop async service. Regardless where your instance is, take a backup for a good measure, you might need it if things go pear-shaped.

Supported

For supported approach you’d need to cancel the jobs then delete them. You can cancel in bulk without any code by using a third-party SSIS package. If coding does not scare you then use an Update to change the status of the asyncoperations, and you’d get the best performance by wrapping this in an ExecuteMultiple. To minimise latency, run the code directly on the CRM server, also run multiple instances in parallel (code should need to be able to handle concurrency).

Once all jobs are cancelled, use bulk deletion job to mop up. Don’t sweat writing code for that – bulk delete is already very efficient.

Unsupported

On-premises only, of course. Everyone seems to agree that direct SQL will be by an order of magnitute faster though it probably should be the very last resort. Some say that there are too many variables to use it safely while others were “a bit more gung-ho” as far as canceling the jobs is concerned. But everyone seems to agree that deleting records using SQL is not a good idea.

To determine what script T-SQL should be used to cancel the job, run SQL Profiler and cancel a single job while capturing the trace. This trace should give you a solid starting point for the script.

Again, use bulk deletion job to delete the canceled jobs.

Joel: Using bulk delete to kill asynchronous jobs is like a science fiction story where they build a robot that gets out of hand, so they have to build another robot to kill the first robot. Wonder how Asimov’s three laws apply?

2 thoughts on “Tip #987: Kill a few millions async jobs

  1. Leon Tribe says:

    I once had a graduate see if he could send workflows viral by having each once call itself twice. It worked and the system soon bogged down to treacle with literally thousands of system jobs running.

    Mass delete could not keep up. In the end I removed roles from the user the jobs were running under.

Leave a Reply

Your email address will not be published. Required fields are marked *