38

I have select with more then

70 milion rows

I'd like to save the selected data into the one large csv file on win2012 R2

Q: How to retrieve the data from MySQL by chunks for better performance ?

because when I try to save one the large select I got

out of memory errors

linktoahref
  • 7,812
  • 3
  • 29
  • 51
Toren
  • 6,648
  • 12
  • 41
  • 62

4 Answers4

56

You could try using the LIMIT feature. If you do this:

SELECT * FROM MyTable ORDER BY whatever LIMIT 0,1000

You'll get the first 1,000 rows. The first LIMIT value (0) defines the starting row in the result set. It's zero-indexed, so 0 means "the first row". The second LIMIT value is the maximum number of rows to retrieve. To get the next few sets of 1,000, do this:

SELECT * FROM MyTable ORDER BY whatever LIMIT 1000,1000 -- rows 1,001 - 2,000
SELECT * FROM MyTable ORDER BY whatever LIMIT 2000,1000 -- rows 2,001 - 3,000

And so on. When the SELECT returns no rows, you're done.

This isn't enough on its own though, because any changes done to the table while you're processing your 1K rows at a time will throw off the order. To freeze the results in time, start by querying the results into a temporary table:

CREATE TEMPORARY TABLE MyChunkedResult AS (
  SELECT *
  FROM MyTable
  ORDER BY whatever
);

Side note: it's a good idea to make sure the temporary table doesn't exist beforehand:

DROP TEMPORARY TABLE IF EXISTS MyChunkedResult;

At any rate, once the temporary table is in place, pull the row chunks from there:

SELECT * FROM MyChunkedResult LIMIT 0, 1000;
SELECT * FROM MyChunkedResult LIMIT 1000,1000;
SELECT * FROM MyChunkedResult LIMIT 2000,1000;
.. and so on.

I'll leave it to you to create the logic that will calculate the limit value after each chunk and check for the end of results. I'd also recommend much larger chunks than 1,000 records; it's just a number I picked out of the air.

Finally, it's good form to drop the temporary table when you're done:

DROP TEMPORARY TABLE MyChunkedResult;
Ed Gibbs
  • 25,924
  • 4
  • 46
  • 69
  • How to make it in a loop? – Egor Okhterov May 08 '18 at 13:26
  • 1
    @Pixar basically it depends on which kind of technology you're gonna planning to use. You can perform it directly in MySQL with the `CREATE procedure` Or another good way would be to do it in php and python with a while cycle and a specific chunk size for the data you want to select. My suggestion is to use `pipe` and `stream` with **node** which is the worth method. – Giulio Bambini Jun 07 '18 at 17:21
  • This method is unuseable for large table, only tiny tables work well like that. – John Jul 24 '18 at 01:41
  • I search for it too much. I think this is the best simple answer. – shgnInc Aug 15 '18 at 04:42
  • @John I think you'd have to narrow down the SELECT statement in that case. `SELECT * FROM table where item="something" LIMIT 0,1000;` – Aaron McKeehan Jan 04 '19 at 20:48
  • 2
    @Aaron Using OFFSET is also not useable in large tables. I've invested months in such tasks and the only solution is to have a primary numeric key you can manually go through. "WHERE id BETWEEN a AND a+10000" a+=10000; The Problem is that mysql is way too stupid to allow such tasks done properly. It does not remember any internal pointers so you can not continue where you stopped. – John Jan 05 '19 at 01:19
  • @John is right, offset will count up until the specified value. The better approach is using a last Id reference and using LIMIT. – Gokigooooks Feb 28 '20 at 20:00
  • I agree that OFFSET is less than ideal, and that tuning the table with things like numeric PKs will be faster. Omitting OFFSET and splitting by date range or some other high-cardinality value would also work, but unfortunately that the OP didn't provide information to allow that kind of answer. This solution is generic, for those with large (but not huge) tables, or for those with no other option but to ignore speed so the proc will at least run without blowing up. – Ed Gibbs Feb 29 '20 at 03:43
  • @EdGibbs what if each day more than terabyte of data is getting accumulated in my db and I have to fetch data for atleast 10days? – Aadhi Verma May 08 '23 at 02:57
  • @AadhiVerma in that case don't try this answer. It's based on 70MB of data, which isn't that much so there's no optimization. I'd recommend a stepped approach like prafi's answer below: pull the first 10,000 rows ordering by timestamp *and* a tiebreaker, use values from the 10,000th row to feed the `WHERE` clause for the next query, repeat until done. Be *very* careful of ties, where the 10,000th record has a timestamp the same as the 10,001st, in which case Prafi's answer won't include the 10,001st -- that's why you'll need a tiebreaker. Hope this helps; it's all I can fit into a comment. – Ed Gibbs May 08 '23 at 19:36
17

The LIMIT OFFSET approach slows query down when a size of the data is very large. Another approach is to use something called Keyset pagination. It requires a unique id in your query, which you can use as a bookmark to point to the last row of the previous page. The next page is fetched using the last bookmark. For instance:

SELECT user_id, name, date_created
FROM users
WHERE user_id > 0
ORDER BY user_id ASC
LIMIT 10 000;

If the resultset above returns the last row with user_id as 12345, you can use it to fetch the next page as follows:

SELECT user_id, name, date_created
FROM users
WHERE user_id > 12345
ORDER BY user_id ASC
LIMIT 10 000;

For more details, you may take a look at this page.

prafi
  • 920
  • 9
  • 11
0

Another approach for such a large dataset, to avoid the need to chunk the output, would be to query the relevant data into its own new table (not a temporary table) containing just the data you need, and then use mysqldump to handle the export to file.

Peter
  • 280
  • 2
  • 9
0

Use an unbuffered result set using MYSQLI_USE_RESULT to be able to read through the database and perform a function such as write output to a CSV file row by row.

In short: It writes to CSV/File while reading from Database.

When using mysqli_query it uses MYSQLI_USE_STORE by default and reads the whole database and gets a result set which causes excess memory usage.

Read this for more info on MYSQLI_USE_RESULT and be careful since you may not being able to perform other tasks/queries on the database while the function is running

Meesam
  • 1