Wednesday, March 12, 2008

Finding the median value in SQL

Most SQL dialects don't come with a median function. But one can be cobbled together using the procedural programming language that accompanies most databases.

They work like this:
1. Sort the list of values.
2. Count the number of items in the list.
3. Select the element at position n/2.

A function like this can't be expressed purely in SQL because SQL is not procedural.

I've come up with a median function that uses pure SQL:

select age
from person
where (select count(age) from person p1 where p1.age >= person.age) -
(select count(age) from person p2 where p2.age <= person.age) between 0 and 1;

This technique uses two correlated subqueries. The outer query works through the list of values (in this case, age). As it considers each age in turn, it counts the number of ages that are greater than the target age and the number of ages that are less. The difference between the two should be 0 or 1 (depending on whether there are an odd or even number of ages to consider).

When the difference is 0 or 1, you've found the median value.

This solution is O(n2). Expect it to take a long time on large data sets.

2 comments:

Eric said...

Interesting approach. Unfortunately, it returns no result when the median is a common value. For example:

[0, 0, 0, 7, 7, 7, 7, 10]

In this case, you subtraction always yields -7, -2, or 5, none of which are between 0 and 1.

My solution to this problem is to sort the results by closeness to the middle, then use MySQL's "limit" constraint to keep only the first item. Of course this means that it is no longer pure SQL :-(

select age from person order by abs((select count(age) from person p1 where p1.age >= person.age) -
(select count(age) from person p2 where p2.age <= person.age)) limit 1;

Barry Brown said...

You've hit the nail on the head. I had originally shown this to my SQL class and it worked on the dataset we were working with. But some students found datasets where it didn't work, so we came up with the solution that you found: using abs() to calculate how close they are to the middle.

Your solution needs one minor tweak: In the first sub-query, use >= as you did, but in the other, use less-than (<) instead of <=.