1

I need to import a large CSV file, broken down to small chunks that will be imported every X hours.

I made the following rake task

task :import_reviews => :environment do
 require 'csv'
 CSV.foreach('reviews.csv', :headers => true) do |row|
  Review.create(row.to_hash)
 end
end

Using heroku scheduler I could let this task run every day, but I want to break it up in several chunks, for example 100 records every day:

That means I need to keep track of the last row imported, and start with that row += 1 the next time I would let the rake task run, how can I implement this?

Thanks in advance!

Laurens
  • 2,420
  • 22
  • 39

2 Answers2

0

Read the rest of the CSV in to an array and outside the CSV.foreach loop write to the same CSV file, so that it gets smaller each time. I suppose i don't have to give this in code but if necessary comment me and i'll do.

If you want to keep the CSV in a whole, add a field "pocessed" to the CSV and fill it with a 1 if read, next time filter these out.

EDIT: this isn't tested and sure could be better but just to show what i mean

require 'csv'
index = 1
csv_out = CSV::Writer.generate(File.open('new.csv', 'wb'))
CSV.foreach('reviews.csv', :headers => true) do |row|
  if index < 101
    Review.create(row.to_hash)
  else
    csv_out << row
  end
  index += 1
end
csv_out.close

afterward, dump reviews.csv and rename new.csv to reviews.csv

peter
  • 41,770
  • 5
  • 64
  • 108
  • Is it possible to write to the CSV in the foreach loop, so I can set the processed field after the record is created ? – Laurens May 18 '12 at 10:42
  • don't know if you can update a field with the normal CSV gem, good for a new question, i think it is possible with the fastercsv gem. See http://stackoverflow.com/questions/3561278/parse-a-csv-update-a-field-then-save for a way – peter May 18 '12 at 14:46
  • Peter, could you please give me an example how to add a field processed to the CSV after it has been read ? I'm having difficulties writing to the CSV in a foreach loop. Thanks in advance. – Laurens May 18 '12 at 21:28
0

you might want to do something like this for the chunked CSV parsing, and then enqueue the jobs which hit the database with Resque and schedule them in an appropriate way, so they run throttled:

https://gist.github.com/3101950

Tilo
  • 33,354
  • 5
  • 79
  • 106