0

I have huge CSV file of 5.5 GB size, it has more than 100 columns in it. I want to import only specific columns from the CSV file. What are the possible ways to do this?

I want to import it to two different tables. Only one field to one table and rest of the fields into another table.

Should i use COPY command in Postgresql or CSV class or SmartCSV kind of gems for this purpose?

Regards, Suresh.

1 Answers1

0

If I had 5Gb of CSV, I'd better import it without Rails! But, you may have a use case that needs Rails...

Since you've said RAILS, I suppose you are talking about a web request and ActiveRecord...

If you don't care about waiting (and hanging one instance of your server process) you can do this:

Before, notice 2 things: 1) use of temp table, in case of errors you don't mess with your dest table - this is optional, of course. 2) use o option to truncate dest table first

CONTROLLER ACTION:

def updateDB
    remote_file = params[:remote_file]   ##<ActionDispatch::Http::UploadedFile>
    truncate = (params[:truncate]=='true') ? true : false
    if remote_file
        result = Model.csv2tempTable(remote_file.original_filename, remote_file.tempfile) if remote_file
        if result[:result]
            Model.updateFromTempTable(truncate)
            flash[:notice] = 'sucess.'
        else
            flash[:error] = 'Errors: ' + result[:errors].join(" ==>")
        end
    else
        flash[:error] = 'Error: no file given.'
    end
    redirect_to somewhere_else_path
end

MODEL METHODS:

# References:
# http://www.kadrmasconcepts.com/blog/2013/12/15/copy-millions-of-rows-to-postgresql-with-rails/
# http://stackoverflow.com/questions/14526489/using-copy-from-in-a-rails-app-on-heroku-with-the-postgresql-backend
# http://www.postgresql.org/docs/9.1/static/sql-copy.html
#
def self.csv2tempTable(uploaded_name, uploaded_file)
    erros = []
    begin
        #read csv file
        file = uploaded_file
        Rails.logger.info "Creating temp table...\n From: #{uploaded_name}\n "
        #init connection
        conn = ActiveRecord::Base.connection
        rc = conn.raw_connection
        # remove columns created_at/updated_at 
        rc.exec "drop   table IF EXISTS #{TEMP_TABLE}; "
        rc.exec "create table #{TEMP_TABLE} (like #{self.table_name}); "
        rc.exec "alter table #{TEMP_TABLE} drop column created_at, drop column updated_at;"
        #copy it!
        rc.exec("COPY #{TEMP_TABLE} FROM STDIN  WITH CSV HEADER")

        while !file.eof?
          # Add row to copy data
          l = file.readline
          if l.encoding.name != 'UTF-8'
              Rails.logger.info "line encoding is #{l.encoding.name}..."
              # ENCODING: 
              # If the source string is already encoded in UTF-8, then just calling .encode('UTF-8') is a no-op,
              # and no checks are run. However, converting it to UTF-16 first forces all the checks for invalid byte
              # sequences to be run, and replacements are done as needed.
              # Reference: http://stackoverflow.com/questions/2982677/ruby-1-9-invalid-byte-sequence-in-utf-8?rq=1
              l = l.encode('UTF-16', 'UTF-8').encode('UTF-8', 'UTF-16')
          end
          Rails.logger.info "writing line with encoding #{l.encoding.name} => #{l[0..80]}"
          rc.put_copy_data(  l  )
        end
        # We are done adding copy data
        rc.put_copy_end
        # Display any error messages
        while res = rc.get_result
          e_message = res.error_message
          if e_message.present?
            erros << "Erro executando SQL: \n" + e_message
          end
        end
    rescue StandardError => e
        erros << "Error in csv2tempTable: \n #{e} => #{e.to_yaml}"
    end
    if erros.present?
        Rails.logger.error erros.join("*******************************\n")
        { result: false, erros: erros }
    else
        { result: true, erros: [] }
    end

end

# copy from TEMP_TABLE into self.table_name
# If <truncate> = true, truncates self.table_name first
# If <truncate> = false, update lines from TEMP_TABLE into self.table_name
#
def self.updateFromTempTable(truncate)
    erros = []
    begin
        Rails.logger.info "Refreshing table #{self.table_name}...\n Truncate: #{truncate}\n "
        #init connection
        conn = ActiveRecord::Base.connection
        rc = conn.raw_connection
        #
        if truncate
            rc.exec "TRUNCATE TABLE #{self.table_name}"
            return false unless check_exec(rc)
            rc.exec "INSERT INTO #{self.table_name} SELECT *, '#{DateTime.now}' as created_at, '#{DateTime.now}' as updated_at FROM #{TEMP_TABLE}"
            return false unless check_exec(rc)
        else
            #remove lines from self.table_name that are present in temp
            rc.exec "DELETE FROM #{self.table_name} WHERE id IN  ( SELECT id FROM #{FARMACIAS_TEMP_TABLE} )"
            return false unless check_exec(rc)
            #copy lines from temp into self  + includes timestamps
            rc.exec "INSERT INTO #{self.table_name} SELECT *, '#{DateTime.now}' as created_at, '#{DateTime.now}' as updated_at FROM #{FARMACIAS_TEMP_TABLE};"
            return false unless check_exec(rc)
        end

    rescue StandardError => e
        Rails.logger.error  "Error in updateFromTempTable: \n #{e} => #{e.to_yaml}"
        return false
    end

    true
end
Fernando Fabreti
  • 4,277
  • 3
  • 32
  • 33