5

I want to import big amount of cvs data (not directly to AR, but after some fetches), and my code is very slow.

def csv_import 
    require 'csv'
    file = File.open("/#{Rails.public_path}/uploads/shate.csv")
    csv = CSV.open(file, "r:ISO-8859-15:UTF-8", {:col_sep => ";", :row_sep => :auto, :headers => :first_row})

    csv.each do |row|
      #ename,esupp= row[1].split(/_/) 
      #(ename,esupp,foo) = row[1]..split('_')
      abrakadabra = row[0].to_s()
      (ename,esupp) = abrakadabra.split(/_/)
      eprice = row[6]
      eqnt = row[1]
      # logger.info("1) ")
      # logger.info(ename)
      # logger.info("---")
      # logger.info(esupp)
      #----
      #ename = row[4]
      #eprice = row[7]
      #eqnt = row[10]
      #esupp = row[12]

        if ename.present? && ename.size>3
        search_condition = "*" + ename.upcase + "*"     

        if esupp.present?
          #supplier = @suppliers.find{|item| item['SUP_BRAND'] =~ Regexp.new(".*#{esupp}.*") }
          supplier = Supplier.where("SUP_BRAND like ?", "%#{esupp}%").first
          logger.warn("!!! *** supp !!!")
          #logger.warn(supplier)
        end

        if supplier.present?

          @search = ArtLookup.find(:all, :conditions => ['MATCH (ARL_SEARCH_NUMBER) AGAINST(? IN BOOLEAN MODE)', search_condition.gsub(/[^0-9A-Za-z]/, '')])
          @articles = Article.find(:all, :conditions => { :ART_ID => @search.map(&:ARL_ART_ID)})
          @art_concret = @articles.find_all{|item| item.ART_ARTICLE_NR.gsub(/[^0-9A-Za-z]/, '').include?(ename.gsub(/[^0-9A-Za-z]/, '')) }

          @aa = @art_concret.find{|item| item['ART_SUP_ID']==supplier.SUP_ID} #| @articles
          if @aa.present?
            @art = Article.find_by_ART_ID(@aa)
          end

          if @art.present?
            @art.PRICEM = eprice
            @art.QUANTITYM = eqnt
            @art.datetime_of_update = DateTime.now
            @art.save
          end

        end
        logger.warn("------------------------------")       
      end

      #logger.warn(esupp)
    end
 end

Even if I delete and get only this, it is slow.

def csv_import 
    require 'csv'
    file = File.open("/#{Rails.public_path}/uploads/shate.csv")
    csv = CSV.open(file, "r:ISO-8859-15:UTF-8", {:col_sep => ";", :row_sep => :auto, :headers => :first_row})

    csv.each do |row|
    end
end

Could anybody help me increase the speed using fastercsv?

undur_gongor
  • 15,657
  • 5
  • 63
  • 75
byCoder
  • 3,462
  • 6
  • 28
  • 49
  • This won't affect speed, but you are not closing the `file` Use File.readlines ("/file"). Then you don't have to worry about leaving a file open. – The Who Sep 10 '12 at 18:29
  • @TheWho don't understand you, could give detail example? – byCoder Sep 10 '12 at 19:44
  • If you run File.open then you need to close to the file. You don't want to leak open files. http://stackoverflow.com/questions/4795447/rubys-file-open-and-the-need-for-f-close – The Who Sep 11 '12 at 15:32
  • @TheWho how close file if i use CSV.foreach("/#{Rails.public_path}/uploads/hshatem2.csv", {:col_sep => ',', :row_sep => :auto, :headers => :first_row}) do | row | – byCoder Sep 11 '12 at 21:51
  • If you open a file in block (such as File.open(file) { ... }) then the file will close automatically. – The Who Sep 12 '12 at 16:29
  • @TheWho but if not? if i open it in foreach? – byCoder Sep 12 '12 at 17:05
  • Did you try a profiler to find the time consuming parts? Just add a `require 'profile'`. Perhaps you should try with a smaller testfile. Are you sure your problem is CSV? Perhaps it is a DB-problem (you are running selects with `like`, I see no commit...) – knut Sep 12 '12 at 18:32
  • with like is small table, another table artlookup is about 3gb... with it are troubles to, but first i want ti know why csv is so long opening (encoding is important too in speed...) – byCoder Sep 12 '12 at 19:15
  • @knut also) why in log i get new actions after starting import, but in console they appear after some time (>1h) ? – byCoder Sep 24 '12 at 20:38
  • @TheWho also) why in log i get new actions after starting import, but in console they appear after some time (>1h) ? – byCoder Sep 24 '12 at 20:38
  • @PavelBY: Output on console is not visible immediate. You can force an immediate output with `STDOUT.flush`. See also http://stackoverflow.com/questions/6579273/program-running-not-in-order or http://stackoverflow.com/questions/5944887/ruby-puts-not-outputting-in-real-time – knut Sep 24 '12 at 21:12

4 Answers4

2

As it's name implies Faster CSV is Well Faster :)

http://fastercsv.rubyforge.org

also see. for some more info

Ruby on Rails Moving from CSV to FasterCSV

Community
  • 1
  • 1
Doon
  • 19,719
  • 3
  • 40
  • 44
  • it's standart ruby 1.9 fastercsv !!! Some admin @sawa deleted from my question this important word!!! – byCoder Aug 28 '12 at 20:21
  • @PavelBY You just had that word in parentheses, isolated from other senences, and it was not clear what it meant. – sawa Aug 28 '12 at 20:30
  • @sawa please do as i done before! but do this as you say => clear and senseable – byCoder Aug 28 '12 at 20:37
2

I don't think it will get much faster.

That said, some testing shows that a significant part of time is spent for the transcoding (about 15% for my test case). So if you could skip that (e.g. by creating the CSV in UTF-8 already) you would see some improvement.

Besides, according to ruby-doc.org the "primary" interface for reading CSVs is foreach, so this should be preferred:

def csv_import
  import 'csv'
  CSV.foreach("/#{Rails.public_path}/uploads/shate.csv", {:encoding => 'ISO-8859-15:UTF-8', :col_sep => ';', :row_sep => :auto, :headers => :first_row}) do | row |
    # use row here...
  end
end

Update

You could also try splitting the parsing into several threads. I reached some performance increase experimenting with this code (treatment of heading left out):

N = 10000
def csv_import
  all_lines = File.read("/#{Rails.public_path}/uploads/shate.csv").lines
  # parts will contain the parsed CSV data of the different chunks/slices
  # threads will contain the threads
  parts, threads = [], []
  # iterate over chunks/slices of N lines of the CSV file
  all_lines.each_slice(N) do | plines |
    # add an array object for the current chunk to parts
    parts << result = []
    # create a thread for parsing the current chunk, hand it over the chunk 
    # and the current parts sub-array
    threads << Thread.new(plines.join, result) do  | tsrc, tresult |
      # parse the chunk
      parsed = CSV.parse(tsrc, {:encoding => 'ISO-8859-15:UTF-8', :col_sep => ";", :row_sep => :auto})
      # add the parsed data to the parts sub-array
      tresult.replace(parsed.to_a)
    end
  end
  # wait for all threads to finish
  threads.each(&:join)
  # merge all the parts sub-arrays into one big array and iterate over it
  parts.flatten(1).each do | row |
    # use row (Array)
  end
end

This splits the input into chunks of 10000 lines and creates a parsing thread for each of the chunks. Each threads gets handed over a sub-array in the array parts for storing its result. When all threads are finished (after threads.each(&:join)) the results of all chunks in parts are joint and that's it.

undur_gongor
  • 15,657
  • 5
  • 63
  • 75
  • hm, could you rewrite yours update accroding to my question? ...you could get+50 – byCoder Sep 11 '12 at 17:46
  • also, how can i convert my csv to utf8 so that ruby understand it? i try utf8, all is ok, but when in my utf8-doc appear russian word (it will have many of them) it send utf8 errror... how solve it? – byCoder Sep 11 '12 at 22:15
  • Sorry, I don't understand that question. Does Ruby has a problem reading the UTF-8 CSV or has the CSV a different encoding? Maybe you should post another Stackoverflow question. – undur_gongor Sep 12 '12 at 07:00
  • also, please edit yours updated code for my code... and explain it) – byCoder Sep 12 '12 at 09:05
  • Did that. If Ruby cannot read a correctly UTF-8 encoded CSV (containing Russian) then it's a Ruby bug. I doubt that though. You should open a new question for that issue. – undur_gongor Sep 12 '12 at 09:30
  • @PavelBY: Are you explicitly setting the input encoding to UTF-8? – Linuxios Sep 15 '12 at 02:21
  • also) why in log i get new actions after starting import, but in console they appear after some time (>1h) ? – byCoder Sep 24 '12 at 20:37
  • @PavelBY: Added some (too many?) comments. And sorry, I do not understand your last comment/question at all. – undur_gongor Sep 24 '12 at 21:49
0

I'm curious how big the file is, and how many columns it has.

Using CSV.foreach is the preferred way. It would be interesting to see the memory profile as your app is running. (Sometimes the slowness is due to printing, so make sure you don't do more of that than you need)

You might be able to preprocess it, and exclude any row that doesn't have the esupp, as it looks like your code only cares about those rows. Also, you could truncate any right-side columns you don't care about.

Another technique would be to gather up the unique components and put them in a hash. Seems like you are firing the same query multiple times.

You just need to profile it and see where it's spending its time.

J_McCaffrey
  • 1,455
  • 12
  • 15
  • that query via findings take time i know, but why this csv is opening so long... as variant i't encoding, but how solve it? also i opened for this question, but nobody help – byCoder Sep 13 '12 at 22:12
  • well, its hard to help if we don't know the important details. Strip out sensitive data and through the important parts in a gist that we could check out. Make it easier to help you, and you'll get some help, I'm sure of it. – J_McCaffrey Sep 14 '12 at 15:37
0

check out the Gem smarter_csv! It can read CSV files in chunks, and you can then create Resque jobs to further process and insert those chunks into a database.

https://github.com/tilo/smarter_csv

Tilo
  • 33,354
  • 5
  • 79
  • 106