Kicking off some posts about various performance challenges we've fixed.

N Factorial

[Edit: Thomas point out this is not really factorial, because it's addition of each number in the sequence intead of multiplication].

During a code review for a site we were taking over, I found this little gem:

<?php

function charity_view_views_pre_render($view) {
  // this code takes the rows returned from a view query after the query has been run, and formats it for display...
  // snip to the code of interest:
  usort($view->result, 'charity_view_sort_popular');
}

// sorting callback function, called by usort using a bubble sort algorithm
function charity_view_sort_popular($row1, $row2) {
  $a = charity_view_popularity($row1->nid);
  $b = charity_view_popularity($row2->nid);
  if ($a === $b) {
    return 0;
  }
  return ($a < $b) ? 1 : -1;
}

// look up popularity of a charity
function charity_view_popularity($entity_id){
  $result = db_query("SELECT count(distinct etid) as count FROM {og_membership} WHERE gid=:gid", array(':gid' => $entity_id));
  $count = 0;
  foreach ($result as $row) {
    $count = $row->count;
  }
  return $count;
}

 

... For those who don't code, there's a couple things to understand. First of all, this is using a bubble sort[edit: Thomas points out that PHP uses a faster algorithm called Quicksort, but the performance implications do still apply]. A bubble sort is done by "bubbling" items until they are sorted -- imagine taking 100 chips marked with numbers from 1 to 100. Mix them all up, and then put them in a random order. Now how do you put them all in order? With a bubble sort, you take the first two chips in the row, look at them, and then put the one with the smaller number to the left. Then pick up the second and third chips, and put them in order. Then the third and fourth. Repeat until you reach the end. Chip number 100 will end up as the last chip on this pass. Then start again at the beginning, and do the same thing -- chip 99 will end up right next to 100 and you can stop. You potentially have to do 12254950 individual comparisons before all the chips are in the right order.

Computers can do this very, very, very fast, so that's not the problem. The problem is what it has to do inside that comparison -- query the database not once but twice.

The real problem here is that the code has to wait for the response from each database query to do the comparison, before going to the next (of 12254950) iterations of the sort.

And, worse, adding a single item to the database adds potentially another 100 database calls. So the more data there is in this system, the dramatically worse the performance will get. Not exponentially, but by an ever increasing sequence -- 1+2+3+4+5+6+7...+n

How many performance killers are in just this single example? Too many to go into, but I'll highlight what I consider the most egregious.

Performance killer: database calls inside a sort

Never, never do this.

The fix? Sort the results entirely in the database. Databases are extremely good at sorting -- if you are sorting any kind of result in the code layer based on data in the database, you're doing it wrong.

The funny thing about this chunk of code is that the developer knew enough about the internal workings of Drupal views to hook into functions to alter the results. Instead of doing that, the developer should have altered the query that views was generating to put all the sorting in the database where it belongs -- if they couldn't get views to generate the query they wanted by adding appropriate relationships and fields to the "sort" condition.

Views, since Views 3.0, has built-in support for aggregation. So getting a count is really straightforward. At a minimum, add the field you want to sort on to the result set, so you can avoid any further database calls.

Performance killer: Using PHP to sort all rows

If you can add the field to views, you can sort on it. And if you can do that, you can use paging -- limit the result set to 10, 50, 100, or however many rows you want. Instead of the entire data set.

The second big, big problem with this code is that for this project, the data set would reasonably be expected to reach 10s of thousands of rows. If you're sorting in PHP instead of the database, you need to examine every single row to do the sorting -- which means you need to load every single row to find out how to sort it.

This completely exacerbates the performance problem. Which is why it's a top contender for the Drupal Coding Hall of Shame, in my book!

Solution: Put the sort in the database

To fix this, we didn't even need to use code. We just made the view an aggregate view, added a relationship to the og membership, and sorted on a count of that field. Done.

The bizarre thing about this is that of all the ways you might need to extend or modify views, doing a sort after the query runs strikes me as the worst possible option.

Better would be to alter the query to add a sort clause on the joined table (if what you wanted to do wasn't available inside views).

If you couldn't do it inside views, though, much better than adding a query alteration would be creating a views handler to make what you need available to views. This is sometimes (not always) easier than doing a query alter, but it's much better because you can then re-use that functionality in any other view you might want it, and your original view stays maintainable -- query alters under the hood create a kind of "magical" unexpected behavior on this view that can lead to maintenance headaches down the road.

Industry

Thanks for sharing this! It's really an particularly aweful example of coding without any thought to performance.

However, you have some errors in your post:

  • For exactly the reasons outlined in your post, no-one in their right mind would use bubble sort in a library, and certainly not PHP (neither for usort() nor elsewhere). See the sort() documentation:

    Note: Like most PHP sorting functions, sort() uses an implementation of » Quicksort. The pivot is chosen in the middle of the partition resulting in an optimal time for already sorted arrays.

  • Even if they did, I don't really see how you arrive at your numbers. The maximum number of comparisons, as you get almost right, is 1 + 2 + 3 + … + n - 1 (99 comparisons for the first pass, and one less for each consecutive one). However, that's (with n = 100) 4950, not 1225. And, most importantly (because you even have it in the title): none of those is nowhere near n! = 1 * 2 * 3 * … * n, which would be 93326215443944152681699238856266700490715968264381621468592963895217599993229915608941463976156518286253697920827223758251185210916864000000000000000000000000.

I completely agree with your conclusion, though: sorting always in the database, and preferably in a Views handler.

Hi, Thomas,
Thanks for your comment, and your work on the Search API, we've been using it more and more!
Goes to show I've spent too much time away from the math books and theory ;-)
I wasn't actually aware of Quicksort, and its use in PHP, that's really interesting. Regarding the math, looks like I changed numbers mid-stream! 1225 comparisons are necessary to sort 50 items, so I must have done the math with n=50 and then wrote it up with 100. Thanks for the catch! I'll correct the article for posterity.

Add new comment

The content of this field is kept private and will not be shown publicly.

Filtered HTML

  • Web page addresses and email addresses turn into links automatically.
  • Allowed HTML tags: <a href hreflang> <em> <strong> <blockquote cite> <cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h1> <h2 id> <h3 id> <h4 id> <h5 id> <p> <br> <img src alt height width>
  • Lines and paragraphs break automatically.