Packing With MySQL

I’ve been working on packing a 6.5GB production database based on RelStorage and MySQL 5.0.  It contains 6.4 million objects, meaning “select count(1) from current_object” returns 6426879.  I would call this a moderately sized database.  When I first approached it, it had not been packed for at least 3 months.  At first, I could not pack it, because RelStorage was using statements that MySQL mis-optimizes.  I will discuss how I fixed the problem.

The first time I tried to pack this database, the pre_pack phase took several hours before I stopped it by killing the Python process.  I studied the problem using “show processlist” and the MySQL docs online.  I found a statement in RelStorage that MySQL mis-optimizes.  I tried again and it still took too long; this time I discovered that a workaround I had applied earlier for a mis-optimized query did not actually help this instance of MySQL.  I decided to make the workaround more robust by using an explicit temporary table.

Then, on Saturday, I started a pack again.  The pre_pack phase completed in less than an hour this time and the final phase began.  The final phase gets from the pre_pack phase a complete list of objects and states to delete; all it really has to do is delete them.  I let it run all weekend.  Today (Monday), using “show processlist” again, I measured 17 seconds per transaction, with 26,000 transactions to go.  With a 50% duty cycle, this would have taken about 10 days!  (The 50% duty cycle is a configurable parameter that causes RelStorage to periodically sleep during packing, allowing other clients to write to the database without significant interruption.)

I read up on the delete statement.  It turns out that delete with a subquery is often very expensive in MySQL.  I converted the delete statements to use a table join instead of a subquery, and the time per transaction dropped from 17 seconds to about 3 milliseconds.  This time the pack finished in minutes.

Does this mean I should optimize every statement this way?  Certainly not.  The SQL syntax I had to use to optimize MySQL is not portable, so I had to sacrifice portability, readability, and maintainability, in order to gain acceptable performance.  I will only make that sacrifice when necessary.

I am very happy that I did not try to optimize this code until I was in a position where I could properly measure the performance with millions of real objects.  If I had optimized prematurely, I would have focused on micro-optimizations that might have only doubled or tripled the performance.  By waiting until the time was ripe, I multiplied the pack performance by a factor of at least 5,000, with only a few days of work and only minor code changes.  Next time someone asks why premature optimization is so evil, I will try to remember this experience.  Most of the best software developers I know skip micro-optimizations and focus on readability instead, helping to ensure that much larger optimizations will be possible later.

Note that during my optimization work, I never once needed to interrupt the main site nor restart MySQL.  I only restarted my own Python program repeatedly.  I doubt anyone else knew there was a pack going on.