0

I have a database table with 10,000,000+ rows which I process to create a new table. What is the best type of storage engine for reading?

Currently I use:

$sql = "UPDATE `packages_unsorted` SET `acv_return_viadetails` = '0';";

Then to select the records:

$sql = "SELECT * FROM `packages_unsorted` WHERE `acv_return_viadetails` = '0' LIMIT 1000;";

I collect the primary key id into an array called $ids and I process the records and write the data to a CSV file for import at the end of each 1000 rows. I then update the acv_return_viadetails field with:

$sql = "UPDATE `packages_unsorted` SET `acv_return_viadetails` = '1' WHERE `id` IN ('".implode("','",$ids)."') LIMIT 1000;";

As the above is the only write to the table and I select all fields every time. Should I be using a memory table or some other type of storage engine to make things faster? Or are there any other optimizations I could introduce that would speed things up?

Also what would be a reasonable my.cnf configuration for such an operation.

Thanks in advance :)

Andy Gee
  • 3,149
  • 2
  • 29
  • 44
  • 1
    If you're looking for advanced techniques in tuning your MySQL setup I would highly recommend http://www.mysqlperformanceblog.com/ to determine all the tweaks you can do. Can't see why you can't just use `InnoDb` and make sure `acv_return_viadetails` has an `INDEX`. Maybe if we knew a little bit more about your dataset it might make sense to break it up into smaller tables or to reorganize the data. – Yzmir Ramirez Nov 13 '11 at 08:16
  • Thanks, I'll try Innodb but I can't really split the data up because it's a direct CSV import using LOAD DATA INFILE - approx 90 fields * 10M rows. – Andy Gee Nov 13 '11 at 08:43
  • innodb - clustered primary keys see here http://stackoverflow.com/questions/4419499/mysql-nosql-help-me-to-choose-the-right-one-on-a/4421601#4421601 – Jon Black Nov 13 '11 at 11:02
  • @Andy ofc you can split the data up. You load the file into a staging table then run some sprocs which populate your normalised tables. – Jon Black Nov 13 '11 at 11:05
  • @f00 Thanks, that's essentially what I'm doing now. I import the CSV as is into the staging table packages_unsorted. I then load chunks of data (1000 records at a time to let the mysql server free for other queries). I build a CSV file for import at the end of the procedure. It's taking around 6 hours to complete and I'm wondering if another storage engine would be more appropriate. There is only one index, other than the primary key and I use that for flagging the records I've normalised so I don't think clustered indexes will help me either. I guess I'll have to accept the time it takes. :) – Andy Gee Nov 13 '11 at 11:23
  • Why are you creating a CSV file at the end of the procedure? I think it will be faster if you just added your normalized records to a table instead of creating the CSV and then importing it. Also, I think you can avoid loading the original CSV to the database - just parse it in PHP itself and insert the processed data into a table. Take a look at http://php.net/manual/en/function.fgetcsv.php – nikhil500 Nov 15 '11 at 11:23

2 Answers2

1

Do you have an index on the acv_return_viadetails field?

If you're looking at speeding up the import, I just made a program in C++ to import a 300MB file into mysql in 65s. Perhaps you can adapt it for you use.

See https://github.com/homer6/import-geoip-data

Homer6
  • 15,034
  • 11
  • 61
  • 81
  • +1 for sharing. Can you tell me what is < and > in c++ means: std::auto_ptr< sql::Connection > temp_connection? – Micromega Nov 13 '11 at 08:28
  • That's a template. It allows you to create many different types of classes. For example vector creates a vector (which is a collection) of ints. Whereas vector creates a vector of strings. In the case you mentioned, it creates an auto_ptr to an sql::Connection – Homer6 Nov 13 '11 at 08:30
  • Yes I have an index on `acv_return_viadetails` but I only have access to PHP on the server so I can't use the C++ app. I will use it for other things though :) – Andy Gee Nov 13 '11 at 08:40
0

A few suggestions:

1) Replace

$sql = "SELECT * FROM `packages_unsorted` WHERE `acv_return_viadetails` = '0' LIMIT 1000;";

with

$sql = "SELECT `id` FROM `packages_unsorted` WHERE `acv_return_viadetails` = '0' LIMIT 1000;";

2) Check if it is possible to load all the ids at once without the 1000 limit. With 10M records, it should take only a few hundred MBs to load all the ids.

$sql = "SELECT `id` FROM `packages_unsorted`";

If not, consider increasing the limit.

3) You can try something like the following instead of using acv_return_viadetails:

$sql = "SELECT `id` FROM `packages_unsorted` LIMIT 0, 1000;";
$sql = "SELECT `id` FROM `packages_unsorted` LIMIT 1000, 1000;";
$sql = "SELECT `id` FROM `packages_unsorted` LIMIT 2000, 1000;";
nikhil500
  • 3,458
  • 19
  • 23
  • Thanks I understand but I need all of the fields from the table to process so I have to select * each time. I also need to chunk (by 1000) it so it releases the mysql server for queries on other tables so I can't grab all ids either :( – Andy Gee Nov 13 '11 at 09:14
  • Maybe you could just select the `id`, store them in a PHP list and then process them one-by-one. I believe grabbing all ids should not take too many resources. – nikhil500 Nov 15 '11 at 11:18