160

I am trying to load a small sample of records from a large database into a test database.

How do you tell mysqldump to only give you n records out of 8 million?

Thanks

4 Answers4

245

As skaffman says, use the --where option:

mysqldump --opt --where="1 limit 1000000" database

Of course, that would give you the first million rows from every table.

Adam Bellaire
  • 108,003
  • 19
  • 148
  • 163
  • 19
    What does the "1" before limit do? – Phob Jul 14 '11 at 23:29
  • 35
    @Phob: The --where option is basically appended to a query of the form `SELECT * from table WHERE `, so in this case you get `SELECT * from table WHERE 1 limit 1000000`. Without the 1, you would have an invalid query. Specifying 1 for a where clause (since 1 is always true) simply selects all records. – Adam Bellaire Jul 15 '11 at 01:28
  • 27
    Wow, what a hack. So you can basically SQL inject yourself this way. – Phob Jul 15 '11 at 16:49
  • 8
    Does this maintain all foreign key integrities? If not, is there a way to do that? – keithxm23 Oct 11 '12 at 20:51
  • Is there a way to get the *last* 1,000,000 rows, i.e. the most recently added ones? – Mohamed Hafez Aug 31 '16 at 19:20
  • 4
    Thanks! Additionally, you can use: `mysqldump --opt --where="1 limit 1000000 offset 1000000" --no-create-info database` to get the second page of 1 million records. Make sure to use the _--no-create-info_ flag on pages other than the first to only dump the data and leave off the _create table_ stuff. – pfuri Apr 10 '17 at 19:56
  • 2
    Looks like `--opt` isn't necessary. From the manpages: "Because the --opt option is enabled by default, you only specify its converse, the --skip-opt to turn off several default settings." – apostl3pol Dec 08 '17 at 01:11
66

If you want to get n records from a specific table you can do something like this:

mysqldump --opt --where="1 limit 1000000" database table > dump.sql

This will dump the first 1000000 rows from the table named table into the file dump.sql.

Casper André Casse
  • 573
  • 2
  • 8
  • 20
14

As the default order is ASC which is rarely what you want in this situation, you need to have a proper database design to make DESC work out of the box. If all your tables have ONE primary key column with the same name (natural or surrogate) you can easily dump the n latest records using:

mysqldump --opt --where="1 ORDER BY id DESC limit 1000000" --all-databases > dump.sql

This is a perfect reason to why you should always name your PK's id and avoid composite PK's, even in association tables (use surrogate keys instead).

Andreas Bergström
  • 13,891
  • 5
  • 59
  • 53
  • 1
    Do this (name id and avoid composite PK's) and you'll need to ignore relational database theory. – mold Nov 22 '17 at 17:58
  • 1
    Actually, if you design your database following the best practices of relational database, defining your PK's based on data and entity, you can use --option --where="1 LIMIT 10000" for example. Without ORDER BY, this will work because MySQL will order in natural manner, what is equivalent to say that it will follow the PK's index order. Then, all FK of related tables will have only data that exists in their reference's table because the order will be the same. – mold Nov 22 '17 at 18:22
  • The use of ID's is a true plague of many developers. Having ID's like PK's is the same of doesn't having PK's. Your integrity was go to hole because, in most of the cases, an auto increment number doesn't have nothing to do with the entity data. – mold Nov 22 '17 at 18:22
  • 1
    @mpoletto --where="1 LIMIT 10000" will only pick the 10000 first entries. The whole point of my answer was to show how you would solve getting the latest X entries, which is usually what you want. I also do not understand what naming conventions has to do with "ignoring relational database theory", I think you missunderstood my answer. Most popular ORMs like EF, Django ORM, etc. default to and advise "id" for PK-columns, since it is redundant to say users.user_id instead of just users.id. – Andreas Bergström Nov 23 '17 at 08:50
  • when you say that there is a "perfect reason to why you should always name you PK's id and avoid composite PK's" you are ignoring relational database theory. Your argument about "most popular ORMs" isn't valid because this ORMs need tables with IDs to work. – mold Nov 23 '17 at 16:40
  • @mpoletto And how am I ignoring RBDMS theory by saying that PKs should be called simply id instead of i.e. user_id? – Andreas Bergström Nov 23 '17 at 17:00
  • When you say to avoid composite keys. You don't avoid keys when you design a relational model, you define keys because you need, don't matter if they are composite keys or not. You don't design a model based on what an ORMs needs. Not all models fit to use surrogate keys. But, unfortunately, this is very common practice between programmers. – mold Nov 23 '17 at 18:40
  • Thanks you very much, this is really what I searching for – MAx Shvedov Apr 03 '20 at 19:36
  • 5
    You can also `ORDER BY 1 DESC` if your ID columns aren't named consistently but are the first logical column defined in the table. – U47 Feb 27 '21 at 06:10
  • is there a way through mysql internals to get the most recent N updated rows? – awm Jan 25 '22 at 15:14
  • @awm set an updated_at column and sort on it instead – Andreas Bergström Jan 25 '22 at 17:09
7

mysqldump can be given a SQL query to execute, from which it will take the data for the dump. You can then use the "limit X" clause in your query to restrict the number of rows.

skaffman
  • 398,947
  • 96
  • 818
  • 769
  • Really? How can you give mysqldump a query? I can't find it in the documentation. – Robert Mikes Dec 30 '20 at 21:36
  • @RobertMikes Example: `mysqldump --skip-lock-tables test parent --where="exists(select * from child where parent_id=parent.id)"` – Dexter Aug 31 '22 at 07:13
  • @Dexter OK, agreed, you can limit the number of records when querying a particular table. Which might be enough to cover the use case that the OP asked about. But you can't give a full query to mysqldump, where you would control the fields that are selected from multiple tables... Well, as a workaround you could always create a table and fill it with derived data using INSERT INTO SELECT, then dump that table... – Robert Mikes Sep 22 '22 at 11:46