0

I need to export one million records from mysql to ElasticSearch. I used the following sql queries:

select * from tlogs limit 1,100000
select * from tlogs limit 100001,100000
select * from tlogs limit 200001,100000
select * from tlogs limit 300001,100000
....

Are the above queries ok, I'm worried about the results are random, that would cause duplicated records were imported. I know if I used sort before limit, it will be ok. But the sorting is very slow.

Jack
  • 5,540
  • 13
  • 65
  • 113
  • you have no primary key for that table? – Alex Nov 03 '15 at 19:47
  • 3
    Without `ORDER BY` your result set is not guaranteed – Lukasz Szozda Nov 03 '15 at 19:47
  • 2
    No, they are not ok. You need an `order by` to guarantee the ordering of the results. I would suggest ordering by a primary key, so no work actually needs to be done. – Gordon Linoff Nov 03 '15 at 19:48
  • 1
    Data retrieved using SQL is per definition not ordered. If you see any order now, you have to regard that as pure coincidence and it could change at any time due to code or data changes. – Norbert Nov 03 '15 at 19:48
  • Do you want just any 1,000,000 rows? Or do you want the newest 1,000,000 log entries by date? – Schwern Nov 03 '15 at 20:14

1 Answers1

1

There is no guaranteed sorting order unless you specify one. Even if it seems like they have some sort of order, there's no guarantee that will not change in the next version of MySQL, or the next time you add a row, or delete a row, or change a row, or change an index, or MySQL decides to repack your data.

You must specify a sort order. You also must do your work in a transaction otherwise some other process could add or delete rows while you're running your queries. I chose the primary key because it's likely to already be in order and already be indexed.

BEGIN;
select * from tlogs ORDER BY id LIMIT 1,100000
...
COMMIT;

I would point out that getting your rows a few at a time is a waste. The ORDER BY and LIMITs will be expensive. Instead, you should just query all 1,000,000 rows and fetch them one at a time. In Perl this would be...

my $sth = $dbh->prepare("SELECT * FROM tlogs LIMIT 1000000");
$sth->execute;
while( my $row = $sth->fetch ) {
    ... do something with the $row ...
}

This avoids the need for a transaction, ordering and limits. The query should execute almost immediately. The major cost now is your database API's cost for fetching.


All this just gets you the first 1,000,000 rows MySQL decides to give you. That doesn't seem very useful. It will probably be the last 1,000,000 rows inserted, but that's not guaranteed. There's also no guarantee that the last 1,000,000 rows inserted is the newest 1,000,000 log entries.

If you want the newest 1,000,000 log entries you need an ORDER BY and I hope that field is indexed.

SELECT * FROM tlogs ORDER BY date desc LIMIT 1000000

Finally, you should see if SELECT INTO OUTFILE can be used to export your data. It will likely be the most efficient.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • Thanks! the example one is a small table, actually I have couple of 1.5 billion records tables, so I have to separate the queries by limit. Some tables have primary key, some tables don't have primary key. – Jack Nov 03 '15 at 20:47
  • @Jack Why do you feel you need to separate the queries by LIMIT? You should only be using LIMIT if you want a limited number of rows. If you work row-by-row your SQL client doesn't fetch all 1.5 billion rows at once, just one at a time (or maybe a few for buffering). As for the lack of primary keys, add them now with an ALTER TABLE command. – Schwern Nov 03 '15 at 20:58
  • @Jack Selecting slices with multiple limited queries will be very inefficient and get worse as the table gets larger. For queries with `ORDER BY blah LIMIT X,Y` MySQL has to sort the whole table and read through X rows before outputting Y. It has to redo this for each of your queries. As X becomes bigger the query will become slower. You're reading and sorting the table multiple times. See [this answer](http://stackoverflow.com/a/4502426/14660). If you do it in one query without the LIMIT MySQL only has to sort and read the table once. If you fetch row-by-row you won't use a ton of memory. – Schwern Nov 03 '15 at 21:32
  • Good question! I separate the queries because I need to parallel the Mysql->ES migration. If I can separate the queries I would use multiple servers to do the migration. For example, for a 1.5 billion records table, I can use server one to transform 0.5 billion and at the meantime the second and third server would transform the second and third 0.5 billions. i.e at the same time the mysql server would get three requests from three servers that would be faster. But for the table which has primary key, I will use "where id between 1 and 100M" rather than limit. Thanks! – Jack Nov 03 '15 at 21:42
  • @Jack Doing it in parallel makes it difficult to use a transaction. This runs the risk of changes sneaking in between the queries. Using `WHERE id BETWEEN X AND Y` solves that problem assuming `id` is incremented. It also avoids the cost of an `ORDER BY` though I don't know if `BETWEEN` will be any faster. What might be simpler and faster is `SELECT INTO OUTFILE` and then have multiple processes work on slices of that file. Of course, in the time you took to work this parallel plan out you probably could have finished the single threaded run. :) – Schwern Nov 03 '15 at 22:16