Mysql, the limit's of the offset
I was in conversation with a friend the other day, and we were talking about exporting by batch big quantities of data. I am not sure how we ended up talking about how using the "LIMIT".
There are a few mistakes people does when working with SQL and adding LIMIT's with an offset to their queries. A database allows us to query a database and have results rather fast; but some very simple queries can create performance issues. It's important to understand the basics of how things work to understand the issue.
In this article we will talk about the usage of the "LIMIT" which can create 2 issues :
- Not returning expected results.
- Having performance issues.
Both those issues are not "bug" theye are behaviours you need to be aware of.
First let's have a look why we might not get the expected results in some cases.
Unexpected results.
In our exemple we will consider the fallowing dataset with 10 rows, which contains only an ID.
10 | 13 | 15 | 17 | 21 | 22 | 24 | 25 | 27 | 28 |
Let's now execute the fallowing query :
SELECT * FROM my_table LIMIT 5
This should return the fallowing result :
10 | 13 | 15 | 17 | 21 |
And if we now add an offset of 5, the remaining 5 elements should be fetched
22 | 24 | 25 | 27 | 28 |
We have a a website now displaying this in 2 pages, and we are happy, all our rows are accessible by the end users.
Or is it?
In reality if you do not order by anything, mysql can very well return the same result twice. So you might actually receive the fallowing result :
10 | 13 | 15 | 17 | 21 |
22 | 15 | 25 | 27 | 28 |
So we have received id "15" twice an never received id 24.
This is a documented bahaviour and not a bug!
Most probably you will never have this issue with this for multiple reasons :
if you are using an ORM. The orm will write the fallowing query instead
SELECT * FROM my_table ORDER BY id LIMIT 5
Once ther order by is in place ther Database will work as expected.
It's quite improbable
This behaviour is not easy to observe, if you have only a few hundred or even thousands entries you might never observe it. And even if you do observe it on your website it's very improbable it will be noticed on a page.
This behaviour is relatively easy to observe when doing mass exports in batches from database. I observed it when debuging an export that was exporting the same order multiple times and had missing orders. The database in question had hundreds of thousands of rows and less then dozens of them where missing.
The issue is adding an ORDER BY is not the solution either. So to understand what is actually going on why we don't receive the expected result without an order by, let's see how `LIMIT`and the order by works.
Performance impacts.
Yes, the ordering by id is done using the index, and therefore is extremely fast. Our mysql is using a B-tree as an index. So our data looks like this :
So when we execute our query to get first 5 results, the tree allow the 5 elements to be found imdediateley.
So the fallowing query will go throught the fallowing leafs of the tree :
SELECT * FROM my_table ORDER BY id LIMIT 5
Now if we add our offset to the limit :
SELECT * FROM my_table ORDER BY id LIMIT 5 5
This index still helps but we still had to go throught all 10 indexed items. So we had to analyze 10 id's instead of analyzing just 5. This is understandable because mysql can't know it has to get row id 22 first? It needs to go throught all the previous rows to count how many rows there are before finding the sixth row.
This is fine if you have 2, 5, 10 or even a hundred of page with proper indexes because indexes are fast. But as fast as they are at some point they will not be fast enought.
Let's optimize
Optimizing this if you are sorting on id's is actually very simple. We are going to use the knowldge that the id's are incrementing.
So our first query is unchanged :
SELECT * FROM my_table ORDER BY id LIMIT 5
But in the second query we will do the fallowing :
SELECT * FROM my_table WHERE id > 21 ORDER BY id LIMIT 5
21 is the id of the last row we fetched with the first query. So by doing this we can use the index in an optimized fashion.
Here is a representation of the B-Tree with the elements that were "used".
As you can see nearly none of the elements from the first page is being used in the B tree. The gain here is noticable in big databases.
Let's wrap it up
Basically when you are not using any sorting mysql uses another algorith to select the lines. That method is much faster but comes with the drawback of not being 100% reliable.
Mysql is not the only one having this issue, most databases and even search engines have this issue. For big databases it's therefore recomended to add condittions to optimize queries.
This is even more true if you are using distributed systems such as elasticsearch. When you don't give any conditions each node will return their own first X elements and the main result will need to be processed on a single node. Even if you put a limit of 5, if the offset is of 1000, it will need to go throught at least 1000 document if not more as each shread will return 1000 documents if possible. With a condition each shrad will return 5 documents in worse case scenario.
So next time you are working with pagination, think 2 times.
Note :
You might be wandering but I would like to sort by update date, how can I do that? There might very well be 2 entries updated at the same time.
Well it's not that complicated you will need to sort not just on date, but on date and id :
SELECT *
FROM my_table
WHERE (update_date = '2017-12-21' AND id > 21) OR update_date > '2017-12-21'
ORDER BY update_date,id LIMIT 5
The where here is slightly more complicated, if the update_date
is equal to the update date of the last row we read, then we wish to find rows whose id is superioir to the id of the last row we found. If not we simply get update dates superior to the last update date we found.