In SQL, how do distinct and order by work together?
The best way to illustrate this is through an example. Let’s say that we have a table called Orders like the one below – where each row represents a separate order.
Orders | ||||||||||||||||||||||||||||||||||||||||
|
Now suppose that we want to retrieve all of the salesperson ID’s and sort them in descending order according to their highest respective Order Amount value (that would be the Amount column). This will serve as a ranking of the salespeople to see who has the most valuable orders. And, of course, we only want each salesperson ID to be displayed once in the results – we don’t really care about all of their order amounts, just their highest order amount value.
So, now you think that you can just write some SQL like this to get what you want:
SELECT DISTINCT salesperson_id FROM Orders ORDER BY Amount DESC -- in descending order, returns highest amount first...
DISTINCT and Order By in MySQL
If we run that query in MySQL you may have thought that it would return this:
salesperson_id ---- 2 8 7 1
Running the SQL above in MySQL actually returns this as the result set:
salesperson_id ---- 8 7 2 1
But, wait a minute…if you just look at the Orders table above you can see that the salesperson_id with the highest corresponding Amount is not 8, but 2 – because the salesperson_id of 2 has an order with an amount of 2400! And, 2 appears 3rd in the list. So what the heck is going on here – why is our SQL returning such strange results?
Well, let’s analyze the query a bit more to see what is actually happening. We are asking for every distinct salesperson_id in the Orders table, ordered by their corresponding order Amount. But, the problem here is that the salespeople with salesperson_id values of 2 and 7 both have multiple orders in the Orders table.
The query itself is not specific enough
So, in the query above we are asking MySQL to retrieve every distinct value of the salesperson_id and order those results by their corresponding Amount value. For example, when it comes across orders with salesperson_id’s of 2, it does not know whether we want the row where the order amount is 540 or 2400 – and it has to choose only one of those rows because we specifically asked for distinct values of the salesperson_id . This means that it just chooses one of those rows arbitrarily/randomly – since we never really told it which one. And, MySQL is obviously choosing the row where the amount is 540, because 2 should be returned at the top of our list if it chose the row where the Amount is 2400.
But, you might be thinking that we specify that we want to order the results by the descending Amount values – so why doesn’t the SQL just take the highest value for each salesperson_id and use that? Well, because we never really told SQL that is what we actually wanted! Look closely at the SQL and you will see what I mean – do we ever actually specify to choose the highest Amount for EACH salesperson_id and to use that value? No, we don’t!
And that means the problem is that the SQL is not specific enough – we have to tell the RDBMS exactly what we want in order to get the right results, otherwise you get results that do not make sense. In other words, when you do stupid things, stupid things happen.
Why does Mysql allow columns in the ORDER BY if they are not part of the select DISTINCT list?
Actually, running the query above would result in an error message in other RDBMS’s like SQL Server. The only reason MySQL allows it is because it assumes you know what you are doing – the query would actually make sense if the Amount value was the same across different rows for a given salesperson_id. As always, an example will help clarify what we mean here. Let’s suppose that the Orders table looks like this instead:
Orders | ||||||||||||||||||||||||||||||||||||||||
|
Now, if we run that same exact query:
SELECT DISTINCT salesperson_id FROM Orders ORDER BY Amount DESC -- in descending order, returns highest amount first...
We will now get results that make sense:
salesperson_id ---- 2 8 7 1
The reason we get the results that we expected is that now the rows with salesperson_id’s of 2 and 7 all have the same exact value for Amount. This means that even though MySQL will arbitrarily choose a row in the group of rows with salesperson_id of 2 or 7, each row will have the same exact Amount value as all others, so it does not matter which row MySQL chooses in a given group – you will get the same results.
So, we can say that it is safe to order by with a non-select column and select a different distinct column when the different groups of that distinct column all contain the same values for the order by column. That sounds confusing, but it should make sense if you paid attention to our example. If that condition does not hold true, then you will run the risk of getting some very unexpected results, as we had shown above as well.
Now, the question is what is a good workaround to the problem we presented above? Read on below to find out – the solution we present should work across most (if not all) RDBMS’s.
Workaround for the “ORDER BY items must appear in the select list if SELECT DISTINCT is specified” error message in SQL Server
As we mentioned above, MySQL would allow you to run a query like this without throwing any error:
SELECT DISTINCT salesperson_id FROM Orders ORDER BY Amount DESC
But, SQL Server actually does throw an error which says “ORDER BY items must appear in the select list if SELECT DISTINCT is specified”. So, the question is what is a good solution to modify our SQL so that we can workaround this error message, and get the results that we actually want?
Well, you might think that in order to fix the error message that you would get in SQL server you could just write some code like this:
SELECT DISTINCT salesperson_id, Amount FROM Orders ORDER BY Amount DESC
But, think carefully about what the SQL above is doing. It is applying the DISTINCT keyword to both the salesperson_id and Amount columns – which basically means that every row where those 2 columns have a distinct combination of values will be returned in the results. Take a look at the Orders table and you can see that every row in the table has a distinct combination of the salesperson_id and Amount values, which also means that the salesperson_id and Amount will be returned from every row in the table when the SQL above is run. Of course, the results will be ordered by the Amount in descending order.
And, this is what the results will look like when we run the SQL above:
salesperson_id Amount 2 2400 8 1800 7 720 7 600 2 540 1 460 7 150
But, is this what we actually wanted? No! What we really want is the list of salesperson ID’s in order of who has the highest valued order – where each salesperson ID only appears once in the result list . All the query above is giving us is basically every row’s salesperson_id and Amount combination in the table, ordered by the Amount value.
So what is a workaround for this problem – in other words, how can we be more specific to get what we really want? Well, let’s rephrase the problem – what if we say we want to retrieve each salesperson ID sorted by their respective highest dollar amount value (and only have each salesperson_id returned just once)? This is different than just saying that we want each distinct salesperson ID sorted by their Amount, because we are being more specific by saying that we want to sort by their respective highest dollar amount value. Hopefully you see the difference.
Now that we have a more specific question in mind, let’s see if we can come up with a more specific answerso that we can write the correct SQL. Well, since we want to find the highest Amount value for each salesperson_id, what SQL construct do you think we should use? If you guessed group by you would be correct – because in order to find the highest value for a group of salesperson_id’s, we would need to use the GROUP BY statement. Then, we can order the results by the maximum value in each group of salesperson_ids. So, this is what the SQL would look like:
SELECT distinct salesperson_id FROM Orders GROUP BY salesperson_id ORDER BY MAX(Amount) DESC -- in descending order, returns highest amount first...
Just to clarify how the group by will work – for the “group” of salesperson ID’s equal to 7, the maximum value of the amount would be 720. And for the “group” of salesperson ID’s equal to 2, the maximum value of the amount would be 2400. So, running the SQL above would give us these results, which is correct:
salesperson_id ---- 2 8 7 1
Finally we have a query that makes sense, which also gives us results that make sense!