use smarter csv gem and processing csv in chunks - i need to delete rows from a large csv ( 2GB) by comparing the key/values with another csv (1 GB)

Question

The following is the code i have used. I am not able to delete the rows from Main.csv, when the value of "name" col in Main.csv equals to the value of "name" col in Sub.csv. Please help me on the same. I know i am missing something. Thanks in advance.

require 'rubygems'
require 'smarter_csv'
main_csv = SmarterCSV.process('Main.csv', {:chunk_size => 100}) do |chunk|
short_csv = SmarterCSV.process('Sub.csv', {:chunk_size => 100}) do |smaller_chunk|
    chunk.each do |each_ch|
        smaller_chunk.each do |small_each_ch|
                each_ch.delete_if{|k,v| v == small_each_ch[:name]}

        end
    end
end

end

so, is there any other way, that I can delete the rows of huge csv files without my memory getting effected. — Hayz, Jan 09 '17 at 07:19
Do you have an example, say with 2 CSVs of 10 lines each, and examples of lines you'd like to keep, and lines you'd like to remove? — Eric Duminil, Jan 09 '17 at 08:50
I presume "without my memory getting effected" means that your pc does not have enough ram to load both files into memory. You could consider importing them into SQLite (or MariaDB or some similar DBMS) and writing a simple query to find the duplicates. Alternately, if the files are static, could you just load the keys into memory as a hash and store the line number as the key value and make a second pass to create a new CSV file with only the lines you want. — JLB, Jan 09 '17 at 14:51
Also, unless you are reading a file one byte at a time, "chunking" a few (hundred) records at a time may be unnecessary with today's cached file systems. — JLB, Jan 09 '17 at 14:54
@Stefan - Sub.csv has 2000 rows. whereas Main.csv has around 1million rows. — Hayz, Jan 10 '17 at 03:23
@sam then simply traverse `Sub.csv` and store each `:name` field in an array. Afterwards, traverse `Main.csv` and output each row whose `:name` field is not contained in the previously created array. — Stefan, Jan 10 '17 at 08:01
Thanks stefan, eric and JLB. But i guess i will use JLB's approach and load the same into database and then work around it. — Hayz, Jan 13 '17 at 04:37

score 0 · Answer 1 · answered Jan 28 '18 at 18:25

It's a bit of a non-standard scenario for smarter_csv..

Sub.csv has 2000 rows. whereas Main.csv has around 1million rows.

If all you need to decide is if the name appears in both files, then you can do this:

1) read the Sub.csv file first, and just store the values of name in an array sub_names

2) open an output file for the result.csv file

3) read the Main.csv file, with processing in chunks, and write the data for each row to the result.csv file, if the name does not appear in the array sub_names

4) close the output file - est voila!

use smarter csv gem and processing csv in chunks - i need to delete rows from a large csv ( 2GB) by comparing the key/values with another csv (1 GB)

1 Answers1