Ruby parallel csv importing

Question

I'm importing huge csv file, and i want to split it so, that importing will be faster (i didn't import directly to db, i have some calculation). code looks like this:

def import_shatem
    require 'csv'





    CSV.foreach("/#{Rails.public_path}/uploads/hshatem2.csv", {:encoding => 'ISO-8859-15:UTF-8', :col_sep => ';', :row_sep => :auto, :headers => :first_row}) do | row |

      @eur_cur = Currency.find_by_currency_name("EUR")
      abrakadabra = row[0].to_s()
      (ename,esupp) = abrakadabra.split(/_/)
      eprice = row[6].to_f / @eur_cur.currency_value
      eqnt = /(\d+)/.match(row[1])[0].to_f


        if ename.present? && ename.size>3
        search_condition = "*" + ename.upcase + "*"     

        if esupp.present?
          #supplier = @suppliers.find{|item| item['SUP_BRAND'] =~ Regexp.new(".*#{esupp}.*") }
          supplier = Supplier.where("SUP_BRAND like ?", "%#{esupp}%").first
          logger.warn("!!! *** supp !!!")

        end

        if supplier.present?

          @search = ArtLookup.find(:all, :conditions => ['MATCH (ARL_SEARCH_NUMBER) AGAINST(? IN BOOLEAN MODE) and ARL_KIND = 1', search_condition.gsub(/[^0-9A-Za-z]/, '')])
          @articles = Article.find(:all, :conditions => { :ART_ID => @search.map(&:ARL_ART_ID)})
          #@art_concret = @articles.find_all{|item| item.ART_ARTICLE_NR.gsub(/[^0-9A-Za-z]/, '').include?(ename.gsub(/[^0-9A-Za-z]/, '')) }

          @aa = @articles.find{|item| item['ART_SUP_ID']==supplier.SUP_ID} #| @articles
          if @aa.present?
            @art = Article.find_by_ART_ID(@aa)
          end

          if @art.present?
            #require 'time_diff'
            #cur_time = Time.now.strftime('%Y-%m-%d %H:%M')
            #time_diff_components = Time.diff(@art.datetime_of_update, Time.parse(cur_time))
            limit_time = Time.now + 3.hours
            if  (@art.PRICEM.to_f >= eprice.to_f || @art.PRICEM.blank? ) #&& @art.datetime_of_update >= limit_time) 
              @art.PRICEM = eprice
              @art.QUANTITYM = eqnt
              @art.datetime_of_update = DateTime.now
              @art.save
            end
          end

        end     
      end
    end
  end

How i could parallel it? And get more faster importing?

When I had to something similar (millions of rows) I just split up the CSV into several files (with the Unix `split` command) and started several importers in parallel... — Michael Kohl, Sep 12 '12 at 09:46
Your comment should be an answer to this question. I did the exactly same thing when I encountered the same issue. — DNNX, Sep 12 '12 at 10:02
possible duplicate of [Speed up csv import](http://stackoverflow.com/questions/12166389/speed-up-csv-import) — undur_gongor, Sep 12 '12 at 10:29
You're assigning to a lot of instance variables (@search, @art, ...) in your loop. Do they need to be instance variables? There's a lot of optimization which can be done before trying out parallelism. — Schwern, Nov 16 '21 at 02:14
`find(:all)` suggests this Rails 2 code. Which version of Rails are you using? — Schwern, Nov 16 '21 at 02:46

Tilo · Answer 1 · 2021-11-16T01:52:18.873

1

check out the Gem smarter_csv! It can read CSV files in chunks, and you can then create Sidekiqjobs to process and insert those chunks into a database.

https://github.com/tilo/smarter_csv

edited Nov 16 '21 at 01:52

answered Apr 13 '13 at 03:18

Tilo

33,354
5
79
106

score 0 · Answer 2 · answered Nov 16 '21 at 04:21

Looking at the code, the bottleneck will be the database queries. Running it in parallel will not solve this. Instead, let's see if we can make this more efficient.

The big problem is likely the article search. It's doing multiple queries and searching in memory. We'll get to that last.

Currency.find_by_currency_name is always the same. Extract if from the loop. It's not likely to be a bottleneck, but it helps. And, assuming currency_name is a column of Currency, we can save a little time by fetching a single value instead of loading the whole record with pick.

  def currency_value
    @currency_value ||= Currency.where(currency_name: "EUR").pick(:currency_value)
  end

Similarly, Supplier.where could benefit from caching if the CSV will contain many repeat values. Cache the return value with Memoist.

  extend Memoist

  private def find_supplier_for_esupp(esupp)
    return if esupp.blank?
    Supplier.where("SUP_BRAND like ?", "%#{esupp}%").first
  end
  memoize :find_supplier_for_esupp

%term% will not use a normal B-Tree index, so the search might be slow depending on how large the Supplier table is. If you're using PostgreSQL you can speed this query up with a trigram index.

add_index :suppliers, :SUP_BRAND, using: 'gin', opclass: :gin_trgm_ops

Finally, the article search is probably the biggest bottleneck. It's querying ArtLookup, loading all the records, throwing them all out for a single column. Then searching Article, loading all those in memory, filtering them in memory, and searching Article one last time.

Assuming the relationship between Article and ArtLookup is set up properly in the models, this can be cut down to one query.

  art = Article
    .joins(:art_lookups)
    .merge(
      ArtLookup
        .where(ARL_KIND: 1)
        .where(
          'MATCH (ARL_SEARCH_NUMBER) AGAINST(? IN BOOLEAN MODE)',
          search_condition
        )
    )
    .where(
      ART_SUP_ID: supplier.SUP_ID
    )
    .first

That should be significantly faster.

Altogether, with some other improvements like early return to avoid all those nested ifs.

require 'csv'

class ShatemImporter
  extend Memoist

  # Cache the possibly expensive query to find suppliers.
  private def find_supplier_for_esupp(esupp)
    Supplier.where("SUP_BRAND like ?", "%#{esupp}%").first
  end
  memoize :find_supplier_for_esupp

  # Cache the currency value query outside the loop.
  private def currency_value
    @currency_value ||= Currency.find_by(currency_name: "EUR").currency_value
  end

  def import_shatem(csv_file)
    CSV.foreach(
      csv_file,
      {
        encoding: 'ISO-8859-15:UTF-8', :col_sep => ';', :row_sep => :auto, :headers => :first_row
      }
    ) do |row|
      (ename,esupp) = row[0].to_s().split(/_/)
      eprice = row[6].to_f / currency_value
      eqnt = row[1].match(/(\d+)/).first.to_f

      next if ename.blank? || ename.size < 4
      next if esupp.blank?
      
      supplier = find_supplier_for_esupp(esupp)      
      next if !supplier

      article = Article
        .joins(:art_lookups)
        .merge(
          ArtLookup
            .where(ARL_KIND: 1)
            .where(
              'MATCH (ARL_SEARCH_NUMBER) AGAINST(? IN BOOLEAN MODE)',
              "*#{ename.upcase}*"     
            )
        )
        .where(
          ART_SUP_ID: supplier.SUP_ID
        )
        .first
      next if !article

      if art.PRICEM.blank? || art.PRICEM.to_f >= eprice.to_f
        art.update!(
          PRICEM: eprice,
          QUANTITYM: eqnt,
          datetime_of_update: DateTime.now
        )
      end
    end
  end
end

This is written in Rails 6, your code looks like Rails 2, and it is not tested. But hopefully it will give you avenues for optimization.

Ruby parallel csv importing

2 Answers2