How to persist large amounts of data by reading from a CSV file

Question

How to persist large amounts of data by reading from a CSV file (say 20 million rows). This is running close to 1 1/2 days so far and has persisted only 10 million rows, how can I batch this so that it becomes faster and is there a possibility to run this in a parallel fashion.

I am using the code here to read the CSV, I would like to know if there is a better way to achieve this.

Refer: dealing with large CSV files (20G) in ruby

What's the objective of have 20mil rows in memory? Why not just read each line and chuck this into a SQLite database? (then do the analysis there) If you're trying to functions to a large amount of data, then you'll wanna look at doing MapReduce. — Besto, Dec 01 '14 at 07:34
I would want to process all the data and store it in MySQL database for further processing. Let me know if I can optimize the way I read the data or batch persist the data. Appreciate any code snippets here. Also did the rows go through? — Brisi, Dec 01 '14 at 10:17
Can you provide an example schema? I'll put together a snippet then. — Besto, Dec 02 '14 at 07:04
create_table :test do |t| t.integer :foo, :null => false, :limit => 11 t.integer :foo1, :null => false, :limit => 11 t.integer :foo2, :null => false, :limit => 11 end end — Brisi, Dec 03 '14 at 11:43
I would use other tool for parsing csv into database... like python. — Miknash, Dec 04 '14 at 12:37
It works waaaay faster then script languages for parsing data and it can be a) integrated into rails as a bash script which can be run via task ( only if you need to parse this every now and then, but 10 GB every week? ) or execute it once and forget about it — Miknash, Dec 04 '14 at 16:16
twenty million rows is a lot - I would consider writing it to sql insert statements for like 1000 rows versus 1000 rows for rails processing and see if that gets you benefit. also, sometimes this data needs to be inserted serially and sometimes not. Try both. — timpone, Dec 06 '14 at 04:48

score 4 · Accepted Answer · answered Dec 04 '14 at 23:21

You can try to first split the file into several smaller files, then you will be able to process several files in parallel.

Probably for splinting the file it will be faster to user a tool like split

split -l 1000000 ./test.txt ./out-files-

Then while you are processing each of the files and assuming you are inserting records instead of inserting them one by one, you can combine them into batches and do bulk inserts. Something like:

INSERT INTO some_table 
VALUES 
(1,'data1'), 
(2, 'data2')

For better performance you'll need to build the SQL statement yourself and execute it:

ActiveRecord::Base.connection.execute('INSERT INTO <whatever you have built>')

Also, make sure to run all of your inserts within a `ActiveRecord::Base.transaction { }` block. — Ray Zane, Dec 09 '14 at 21:13

score 2 · Answer 2 · answered Dec 06 '14 at 04:42

2

Since you would like to persist your data to MySQL for further processing, using Load Data Infile from MySQL would be faster. something like the following with your schema:

sql = "LOAD DATA LOCAL INFILE 'big_data.csv' INTO TABLE tests FIELDS TERMINATED BY ',' ENCLOSED BY '\"' LINES TERMINATED BY '\n' (foo,foo1)"

con = ActiveRecord::Base.connection

con.execute(sql)

answered Dec 06 '14 at 04:42

Graham

188
1
6

add local_infile: true to database.yml as per [link] (http://stackoverflow.com/questions/21256641/enabling-local-infile-for-loading-data-into-remote-mysql-from-rails) – Graham Dec 09 '14 at 17:14

sulpha · Answer 3 · 2014-12-12T02:50:53.470

Key points:

If you use MySQL InnoDB engine, my advice is that always define a auto-increment PRIMARY KEY, InnoDB uses clustered index to store data in the table. A clustered index determines the physical order of data in a table.
refer: http://www.ovaistariq.net/521/understanding-innodb-clustered-indexes/
Config your MySQL Server parameters, the most important ones are
(1) close mysql binlog
(2) innodb_buffer_pool_size.
(3) innodb_flush_log_at_trx_commit
(4) bulk_insert_buffer_size
You can read this: http://www.percona.com/blog/2013/09/20/innodb-performance-optimization-basics-updated/
You should use producer-consumer scenario.

Sorry for my poor English.

How to persist large amounts of data by reading from a CSV file

3 Answers3