3

How to persist large amounts of data by reading from a CSV file (say 20 million rows). This is running close to 1 1/2 days so far and has persisted only 10 million rows, how can I batch this so that it becomes faster and is there a possibility to run this in a parallel fashion.

I am using the code here to read the CSV, I would like to know if there is a better way to achieve this.

Refer: dealing with large CSV files (20G) in ruby

Community
  • 1
  • 1
Brisi
  • 1,781
  • 7
  • 26
  • 41
  • What's the objective of have 20mil rows in memory? Why not just read each line and chuck this into a SQLite database? (then do the analysis there) If you're trying to functions to a large amount of data, then you'll wanna look at doing MapReduce. – Besto Dec 01 '14 at 07:34
  • I would want to process all the data and store it in MySQL database for further processing. Let me know if I can optimize the way I read the data or batch persist the data. Appreciate any code snippets here. Also did the rows go through? – Brisi Dec 01 '14 at 10:17
  • Can you provide an example schema? I'll put together a snippet then. – Besto Dec 02 '14 at 07:04
  • create_table :test do |t| t.integer :foo, :null => false, :limit => 11 t.integer :foo1, :null => false, :limit => 11 t.integer :foo2, :null => false, :limit => 11 end end – Brisi Dec 03 '14 at 11:43
  • @Besto Can you provide the code snippet? – Brisi Dec 04 '14 at 12:13
  • I would use other tool for parsing csv into database... like python. – Miknash Dec 04 '14 at 12:37
  • @NickCatib, why? What does it change? – blelump Dec 04 '14 at 16:10
  • It works waaaay faster then script languages for parsing data and it can be a) integrated into rails as a bash script which can be run via task ( only if you need to parse this every now and then, but 10 GB every week? ) or execute it once and forget about it – Miknash Dec 04 '14 at 16:16
  • twenty million rows is a lot - I would consider writing it to sql insert statements for like 1000 rows versus 1000 rows for rails processing and see if that gets you benefit. also, sometimes this data needs to be inserted serially and sometimes not. Try both. – timpone Dec 06 '14 at 04:48

3 Answers3

4

You can try to first split the file into several smaller files, then you will be able to process several files in parallel.

Probably for splinting the file it will be faster to user a tool like split

split -l 1000000 ./test.txt ./out-files-

Then while you are processing each of the files and assuming you are inserting records instead of inserting them one by one, you can combine them into batches and do bulk inserts. Something like:

INSERT INTO some_table 
VALUES 
(1,'data1'), 
(2, 'data2')

For better performance you'll need to build the SQL statement yourself and execute it:

ActiveRecord::Base.connection.execute('INSERT INTO <whatever you have built>')
Kamen
  • 681
  • 3
  • 10
2

Since you would like to persist your data to MySQL for further processing, using Load Data Infile from MySQL would be faster. something like the following with your schema:

sql = "LOAD DATA LOCAL INFILE 'big_data.csv' INTO TABLE tests FIELDS TERMINATED BY ',' ENCLOSED BY '\"' LINES TERMINATED BY '\n' (foo,foo1)"

con = ActiveRecord::Base.connection

con.execute(sql)

Graham
  • 188
  • 1
  • 6
  • add local_infile: true to database.yml as per [link] (http://stackoverflow.com/questions/21256641/enabling-local-infile-for-loading-data-into-remote-mysql-from-rails) – Graham Dec 09 '14 at 17:14
1

Key points:

  1. If you use MySQL InnoDB engine, my advice is that always define a auto-increment PRIMARY KEY, InnoDB uses clustered index to store data in the table. A clustered index determines the physical order of data in a table.
    refer: http://www.ovaistariq.net/521/understanding-innodb-clustered-indexes/
  2. Config your MySQL Server parameters, the most important ones are
    (1) close mysql binlog
    (2) innodb_buffer_pool_size.
    (3) innodb_flush_log_at_trx_commit
    (4) bulk_insert_buffer_size
    You can read this: http://www.percona.com/blog/2013/09/20/innodb-performance-optimization-basics-updated/
  3. You should use producer-consumer scenario.

Sorry for my poor English.

sulpha
  • 11
  • 3