Speed up csv import

Question

I want to import big amount of cvs data (not directly to AR, but after some fetches), and my code is very slow.

def csv_import 
    require 'csv'
    file = File.open("/#{Rails.public_path}/uploads/shate.csv")
    csv = CSV.open(file, "r:ISO-8859-15:UTF-8", {:col_sep => ";", :row_sep => :auto, :headers => :first_row})

    csv.each do |row|
      #ename,esupp= row[1].split(/_/) 
      #(ename,esupp,foo) = row[1]..split('_')
      abrakadabra = row[0].to_s()
      (ename,esupp) = abrakadabra.split(/_/)
      eprice = row[6]
      eqnt = row[1]
      # logger.info("1) ")
      # logger.info(ename)
      # logger.info("---")
      # logger.info(esupp)
      #----
      #ename = row[4]
      #eprice = row[7]
      #eqnt = row[10]
      #esupp = row[12]

        if ename.present? && ename.size>3
        search_condition = "*" + ename.upcase + "*"     

        if esupp.present?
          #supplier = @suppliers.find{|item| item['SUP_BRAND'] =~ Regexp.new(".*#{esupp}.*") }
          supplier = Supplier.where("SUP_BRAND like ?", "%#{esupp}%").first
          logger.warn("!!! *** supp !!!")
          #logger.warn(supplier)
        end

        if supplier.present?

          @search = ArtLookup.find(:all, :conditions => ['MATCH (ARL_SEARCH_NUMBER) AGAINST(? IN BOOLEAN MODE)', search_condition.gsub(/[^0-9A-Za-z]/, '')])
          @articles = Article.find(:all, :conditions => { :ART_ID => @search.map(&:ARL_ART_ID)})
          @art_concret = @articles.find_all{|item| item.ART_ARTICLE_NR.gsub(/[^0-9A-Za-z]/, '').include?(ename.gsub(/[^0-9A-Za-z]/, '')) }

          @aa = @art_concret.find{|item| item['ART_SUP_ID']==supplier.SUP_ID} #| @articles
          if @aa.present?
            @art = Article.find_by_ART_ID(@aa)
          end

          if @art.present?
            @art.PRICEM = eprice
            @art.QUANTITYM = eqnt
            @art.datetime_of_update = DateTime.now
            @art.save
          end

        end
        logger.warn("------------------------------")       
      end

      #logger.warn(esupp)
    end
 end

Even if I delete and get only this, it is slow.

def csv_import 
    require 'csv'
    file = File.open("/#{Rails.public_path}/uploads/shate.csv")
    csv = CSV.open(file, "r:ISO-8859-15:UTF-8", {:col_sep => ";", :row_sep => :auto, :headers => :first_row})

    csv.each do |row|
    end
end

Could anybody help me increase the speed using fastercsv?

This won't affect speed, but you are not closing the `file` Use File.readlines ("/file"). Then you don't have to worry about leaving a file open. — The Who, Sep 10 '12 at 18:29
If you run File.open then you need to close to the file. You don't want to leak open files. http://stackoverflow.com/questions/4795447/rubys-file-open-and-the-need-for-f-close — The Who, Sep 11 '12 at 15:32
@TheWho how close file if i use CSV.foreach("/#{Rails.public_path}/uploads/hshatem2.csv", {:col_sep => ',', :row_sep => :auto, :headers => :first_row}) do | row | — byCoder, Sep 11 '12 at 21:51
If you open a file in block (such as File.open(file) { ... }) then the file will close automatically. — The Who, Sep 12 '12 at 16:29
Did you try a profiler to find the time consuming parts? Just add a `require 'profile'`. Perhaps you should try with a smaller testfile. Are you sure your problem is CSV? Perhaps it is a DB-problem (you are running selects with `like`, I see no commit...) — knut, Sep 12 '12 at 18:32
with like is small table, another table artlookup is about 3gb... with it are troubles to, but first i want ti know why csv is so long opening (encoding is important too in speed...) — byCoder, Sep 12 '12 at 19:15
@knut also) why in log i get new actions after starting import, but in console they appear after some time (>1h) ? — byCoder, Sep 24 '12 at 20:38
@TheWho also) why in log i get new actions after starting import, but in console they appear after some time (>1h) ? — byCoder, Sep 24 '12 at 20:38
@PavelBY: Output on console is not visible immediate. You can force an immediate output with `STDOUT.flush`. See also http://stackoverflow.com/questions/6579273/program-running-not-in-order or http://stackoverflow.com/questions/5944887/ruby-puts-not-outputting-in-real-time — knut, Sep 24 '12 at 21:12

score 2 · Answer 1 · edited May 23 '17 at 11:53

2

As it's name implies Faster CSV is Well Faster :)

http://fastercsv.rubyforge.org

also see. for some more info

Ruby on Rails Moving from CSV to FasterCSV

edited May 23 '17 at 11:53

Community

1
1

answered Aug 28 '12 at 20:17

Doon

19,719
3
40
44

it's standart ruby 1.9 fastercsv !!! Some admin @sawa deleted from my question this important word!!! – byCoder Aug 28 '12 at 20:21
@PavelBY You just had that word in parentheses, isolated from other senences, and it was not clear what it meant. – sawa Aug 28 '12 at 20:30
@sawa please do as i done before! but do this as you say => clear and senseable – byCoder Aug 28 '12 at 20:37

undur_gongor · Accepted Answer · 2012-09-24T21:48:11.750

I don't think it will get much faster.

That said, some testing shows that a significant part of time is spent for the transcoding (about 15% for my test case). So if you could skip that (e.g. by creating the CSV in UTF-8 already) you would see some improvement.

Besides, according to ruby-doc.org the "primary" interface for reading CSVs is foreach, so this should be preferred:

def csv_import
  import 'csv'
  CSV.foreach("/#{Rails.public_path}/uploads/shate.csv", {:encoding => 'ISO-8859-15:UTF-8', :col_sep => ';', :row_sep => :auto, :headers => :first_row}) do | row |
    # use row here...
  end
end

Update

You could also try splitting the parsing into several threads. I reached some performance increase experimenting with this code (treatment of heading left out):

N = 10000
def csv_import
  all_lines = File.read("/#{Rails.public_path}/uploads/shate.csv").lines
  # parts will contain the parsed CSV data of the different chunks/slices
  # threads will contain the threads
  parts, threads = [], []
  # iterate over chunks/slices of N lines of the CSV file
  all_lines.each_slice(N) do | plines |
    # add an array object for the current chunk to parts
    parts << result = []
    # create a thread for parsing the current chunk, hand it over the chunk 
    # and the current parts sub-array
    threads << Thread.new(plines.join, result) do  | tsrc, tresult |
      # parse the chunk
      parsed = CSV.parse(tsrc, {:encoding => 'ISO-8859-15:UTF-8', :col_sep => ";", :row_sep => :auto})
      # add the parsed data to the parts sub-array
      tresult.replace(parsed.to_a)
    end
  end
  # wait for all threads to finish
  threads.each(&:join)
  # merge all the parts sub-arrays into one big array and iterate over it
  parts.flatten(1).each do | row |
    # use row (Array)
  end
end

This splits the input into chunks of 10000 lines and creates a parsing thread for each of the chunks. Each threads gets handed over a sub-array in the array parts for storing its result. When all threads are finished (after threads.each(&:join)) the results of all chunks in parts are joint and that's it.

hm, could you rewrite yours update accroding to my question? ...you could get+50 — byCoder, Sep 11 '12 at 17:46
also, how can i convert my csv to utf8 so that ruby understand it? i try utf8, all is ok, but when in my utf8-doc appear russian word (it will have many of them) it send utf8 errror... how solve it? — byCoder, Sep 11 '12 at 22:15
Sorry, I don't understand that question. Does Ruby has a problem reading the UTF-8 CSV or has the CSV a different encoding? Maybe you should post another Stackoverflow question. — undur_gongor, Sep 12 '12 at 07:00
also, please edit yours updated code for my code... and explain it) — byCoder, Sep 12 '12 at 09:05
Did that. If Ruby cannot read a correctly UTF-8 encoded CSV (containing Russian) then it's a Ruby bug. I doubt that though. You should open a new question for that issue. — undur_gongor, Sep 12 '12 at 09:30
@PavelBY: Are you explicitly setting the input encoding to UTF-8? — Linuxios, Sep 15 '12 at 02:21
also) why in log i get new actions after starting import, but in console they appear after some time (>1h) ? — byCoder, Sep 24 '12 at 20:37
@PavelBY: Added some (too many?) comments. And sorry, I do not understand your last comment/question at all. — undur_gongor, Sep 24 '12 at 21:49

score 0 · Answer 3 · answered Sep 13 '12 at 17:08

0

I'm curious how big the file is, and how many columns it has.

Using CSV.foreach is the preferred way. It would be interesting to see the memory profile as your app is running. (Sometimes the slowness is due to printing, so make sure you don't do more of that than you need)

You might be able to preprocess it, and exclude any row that doesn't have the esupp, as it looks like your code only cares about those rows. Also, you could truncate any right-side columns you don't care about.

Another technique would be to gather up the unique components and put them in a hash. Seems like you are firing the same query multiple times.

You just need to profile it and see where it's spending its time.

answered Sep 13 '12 at 17:08

J_McCaffrey

1,455
12
15

that query via findings take time i know, but why this csv is opening so long... as variant i't encoding, but how solve it? also i opened for this question, but nobody help – byCoder Sep 13 '12 at 22:12
well, its hard to help if we don't know the important details. Strip out sensitive data and through the important parts in a gist that we could check out. Make it easier to help you, and you'll get some help, I'm sure of it. – J_McCaffrey Sep 14 '12 at 15:37

score 0 · Answer 4 · answered Apr 13 '13 at 18:33

0

check out the Gem smarter_csv! It can read CSV files in chunks, and you can then create Resque jobs to further process and insert those chunks into a database.

https://github.com/tilo/smarter_csv

answered Apr 13 '13 at 18:33

Tilo

33,354
5
79
106

Speed up csv import

4 Answers4

Linked