6

I am working on little problem and would have some advice on how to solve it: Given a csv file with an unknown number of columns and rows, output a list of columns with values and the number of times each value was repeated. without using any library.

if the file is small this shouldn't be a problem, but when it is a few Gigs, i get NoMemoryError: failed to allocate memory. is there a way to create a hash and read from the disk instead of loading the file to Memory? you can do that in perl with tied Hashes

EDIT: will IO#foreach load the file into memory? how about File.open(filename).each?

fenec
  • 5,637
  • 10
  • 56
  • 82
  • 1
    This is a work assignment? Show what code you've written. – the Tin Man Dec 12 '12 at 22:19
  • 1
    Just wondering... did you not accept an answer because none of the solutions helped? Or was there another reason? This question just turned up in my activity again and I wondered. – marcus erronius Apr 27 '15 at 02:45

3 Answers3

21

Read the file one line at a time, discarding each line as you go:

open("big.csv") do |csv|
  csv.each_line do |line|
    values = line.split(",")
    # process the values
  end
end

Using this method, you should never run out of memory.

marcus erronius
  • 3,613
  • 1
  • 16
  • 32
6

Do you read the whole file at once? Reading it on a per-line basis, i.e. using ruby -pe, ruby -ne or $stdin.each should reduce the memory usage by garbage collecting lines which were processed.

data = {}
$stdin.each do |line|
  # Process line, store results in the data hash.
end

Save it as script.rb and pipe the huge CSV file into this script's standard input:

ruby script.rb < data.csv

If you don't feel like reading from the standard input we'll need a small change.

data = {}
File.open("data.csv").each do |line|
  # Process line, store results in the data hash.
end
Jan
  • 11,636
  • 38
  • 47
1

For future reference, in such cases you want to use CSV.foreach('big_file.csv', headers: true) do |row|

This will read the file line by line from the IO object with minimal memory footprint (should be below 1MB regardless of file size).

Krzysztof Karski
  • 749
  • 1
  • 12
  • 19