7

I'm not sure this question is related to ruby only, maybe you'll find it relevant to any other language.

I wonder if I should use parse or foreach:

  • CSV.parse(filepath) will parse the entire file and return an array of arrays, that will reflect the csv file and will be stored in the memory. Later, I'll process this array rows.

  • CSV.foreach(filepath) will read/parse the file row-by-row and process it row-by-row.

When it comes to performance, is there any difference? is there a preferable approach?

PS: I know that in ruby I can provide a block with the parse method and then it will handle each row separately.

benams
  • 4,308
  • 9
  • 32
  • 74
  • 1
    Performance difference? That probably depends on how big your CSV files are and how you're working with them. You can answer that question easily enough yourself by benchmarking how you would be using things in your situation. – mu is too short Oct 03 '13 at 00:53
  • Hello @muistooshort, thanks for your reply. I got your answer, and I'll simply measure the how quick is the parsing and how busy is my memory and CPU during the process. In general, usually very big files should be processed row-by-row and if the file is light enough it can be loaded to the memory, right? – benams Oct 03 '13 at 13:19
  • 1
    Usually I suppose. It depends on what style makes sense for what you're doing. – mu is too short Oct 03 '13 at 16:49
  • http://dalibornasevic.com/posts/68-processing-large-csv-files-with-ruby – amtest Feb 16 '17 at 12:11
  • "[Why is “slurping” a file not a good practice?](https://stackoverflow.com/questions/25189262)" is useful. – the Tin Man Jun 13 '20 at 18:37

1 Answers1

6

Here's my test:

require 'csv'
require 'benchmark'

small_csv_file = "test_data_small_50k.csv"
large_csv_file = "test_data_large_20m.csv"

Benchmark.bmbm do |x|
    x.report("Small: CSV #parse") do 
        CSV.parse(File.open(small_csv_file), headers: true) do |row|
            row
        end
    end

    x.report("Small: CSV #foreach") do
        CSV.foreach(small_csv_file, headers: true) do |row|
            row
        end
    end

    x.report("Large: CSV #parse") do 
        CSV.parse(File.open(large_csv_file), headers: true) do |row|
            row
        end
    end

    x.report("Large: CSV #foreach") do
        CSV.foreach(large_csv_file, headers: true) do |row|
            row
        end
    end
end

Rehearsal -------------------------------------------------------
Small: CSV #parse     0.950000   0.000000   0.950000 (  0.952493)
Small: CSV #foreach   0.950000   0.000000   0.950000 (  0.953514)
Large: CSV #parse   659.000000   2.120000 661.120000 (661.280070)
Large: CSV #foreach 648.240000   1.800000 650.040000 (650.062963)
------------------------------------------- total: 1313.060000sec

                          user     system      total        real
Small: CSV #parse     1.000000   0.000000   1.000000 (  1.143246)
Small: CSV #foreach   0.990000   0.000000   0.990000 (  0.984285)
Large: CSV #parse   646.380000   1.890000 648.270000 (648.286247)
Large: CSV #foreach 651.010000   1.840000 652.850000 (652.874320)

The benchmarks were run on a Macbook Pro with 8GB memory. The results indicate the performance is statistically equivalent using either CSV#parse or CSV#foreach.

Headers options removed (only small file tested):

require 'csv'
require 'benchmark'

small_csv_file = "test_data_small_50k.csv"

Benchmark.bmbm do |x|
    x.report("Small: CSV #parse") do 
        CSV.parse(File.open(small_csv_file)) do |row|
            row
        end
    end

    x.report("Small: CSV #foreach") do
        CSV.foreach(small_csv_file) do |row|
            row
        end
    end
end

Rehearsal -------------------------------------------------------
Small: CSV #parse     0.590000   0.010000   0.600000 (  0.597775)
Small: CSV #foreach   0.620000   0.000000   0.620000 (  0.621950)
---------------------------------------------- total: 1.220000sec

                          user     system      total        real
Small: CSV #parse     0.590000   0.000000   0.590000 (  0.597594)
Small: CSV #foreach   0.610000   0.000000   0.610000 (  0.604537)

Notes:

large_csv_file was of a different structure than small_csv_file and therefore comparing results (i.e. rows/sec) between the two files would be inaccurate.

small_csv_file had 50,000 records

large_csv_file had 1,000,000 records

Headers option set to true reduces performance significantly due to building a hash for each field in the row (see the HeadersConverters section: http://www.ruby-doc.org/stdlib-2.0.0/libdoc/csv/rdoc/CSV.html)

Garren S
  • 5,552
  • 3
  • 30
  • 45
  • 1
    I doubt the question is as much about cpu usage and time as it is memory usage. If you are performing an action on each row, then only having the row in memory is conservative. If you are getting an array of arrays, then memory usage should be the same. – Stephen Reid Jul 20 '17 at 18:50