1

Edit (I adjusted the title): I am currently using CSV.foreach but that starts at the first row. I'd like to start reading a file at an arbitrary line without loading the file into memory. CSV.foreach works well for retrieving data at the beginning of a file but not for data I need towards the end of a file.

This answer is similar to what I am looking to do but it loads the entire file into memory; which is what I don't want to do.

I have a 10gb file and the key column is sorted in ascending order:

# example 10gb file rows
key,state,name
1,NY,Jessica
1,NY,Frank
1,NY,Matt
2,NM,Jesse
2,NM,Saul
2,NM,Walt
etc..

I find the line I want to start with this way ...

file = File.expand_path('~/path/10gb_file.csv')

File.open(file, 'rb').each do |line|
  if line[/^2,/]
    puts "#{$.}: #{line}" # 5: 2,NM,Jesse
    row_number = $. # 5
    break
  end
end

... and I'd like to take row_number and do something like this but not load the 10gb file into memory:

CSV.foreach(file, headers: true).drop(row_number) { |row| "..load data..." }

Lastly, I'm currently handling it like the next snippet; It works fine when the rows are towards the front of the file but not when they're near the end.

CSV.foreach(file, headers: true) do |row|
  next if row['key'].to_i < row_number.to_i
  break if row['key'].to_i > row_number.to_i

  "..load data..."
end

I am trying to use CSV.foreach but I'm open to suggestions. An alternative approach I am considering but does not seem to be efficient for numbers towards the middle of a file:

  • Use IO or File and read the file line by line
  • Get the header row and build the hash manually
  • Read the file from the bottom for numbers near the max key value
dwyd
  • 55
  • 2
  • 9
  • 1
    There are ways of accessing a file at a given offset. See the documentation of `IO.read`, `IO.seek` et. al. (`File` is an `IO`). – Raffael May 26 '16 at 20:05
  • 1
    Consider storing your data in a database. Those things are crazy good at accessing data in various ways :) – Raffael May 26 '16 at 20:12
  • @Raffael Thank you for the suggestions, I will look into `IO.read` and `IO.seek`. I was considering storing the data in the database too but was interested in seeing if I could optimize loading a CSV because the data set I am using is replaced pretty frequently. – dwyd May 26 '16 at 20:15
  • dwyd, did you consider using the gem https://rubygems.org/gems/fastercsv ? It claims to be faster, plus it can wrap an IO object. So you could open a file, skip the first 100'000 lines using `scan` or `lines` and then continue reading it as CSV. – Raffael May 26 '16 at 20:39
  • 1
    If the lines are indexed by line number, you could also implement some sort of binary search: Open the file at the middle, then scan for the next line start, read it and check where you are versus what line you wanted to start at. If far off, jump to a different position in the file and try again. Jumps don't have to be binary but can involve some smart guessing. Just watch out for two byte characters. – Raffael May 26 '16 at 20:43
  • 1
    @Raffael FasterCSV has been the default CSV library in Ruby since 1.9 (it's what you get when you do `require "csv"`; you don't need the gem). – Jordan Running May 26 '16 at 22:22
  • My bad, sorry. Good to know, thx. – Raffael May 26 '16 at 23:29

2 Answers2

2

I think you have the right idea. Since you've said you're not worried about fields spanning multiple lines, you can seek to a certain line in the file using IO methods and start parsing there. Here's how you might do it:

begin
  file = File.open(FILENAME)

  # Get the headers from the first line
  headers = CSV.parse_line(file.gets)

  # Seek in the file until we find a matching line
  match = "2,"
  while line = file.gets
    break if line.start_with?(match)
  end

  # Rewind the cursor to the beginning of the line
  file.seek(-line.size, IO::SEEK_CUR)

  csv = CSV.new(file, headers: headers)

  # ...do whatever you want...
ensure
  # Don't forget the close the file
  file.close
end

The result of the above is that csv will be a CSV object whose first row is the row that starts with 2,.

I benchmarked this with an 8MB (170k rows) CSV file (from Lahman's Baseball Database) and found that it was much, much faster than using CSV.foreach alone. For a record in the middle of the file it was about 110x faster, and for a record toward the end about 66x faster. If you want, you can take a look at the benchmark here: https://gist.github.com/jrunning/229f8c2348fee4ba1d88d0dffa58edb7

Obviously 8MB is nothing like 10GB, so regardless this is going to take you a long time. But I'm pretty sure this will be quite a bit faster for you while also accomplishing your goal of not reading all of the data into the file at once.

Jordan Running
  • 102,619
  • 17
  • 182
  • 182
  • thank you! I will try this out today. I appreciate the help. – dwyd May 27 '16 at 13:47
  • Thanks again! That was what I needed. To add more detail for others that might find this answer, I would suggest placing `csv.each { |line| "process data here" }` under `# ...do whatever you want...` and maybe include a note that fields can be accessed within the each loop like this `line['field_name']`. – dwyd May 27 '16 at 15:40
  • I'm glad it helped. I'd be curious to know what effect it had on your runtime when processing that 10GB file. – Jordan Running May 27 '16 at 17:59
  • 1
    Here is the benchmark on the 10gb file: https://gist.github.com/dwyd/d256510e5ec63f44ddb409f2b28616a2 – dwyd May 31 '16 at 16:17
0

Foreach will do everything you need. It streams, so it works well with big files.

CSV.foreach('~/path/10gb_file.csv') do |line| 
   # Only one line will be read into memory at a time.
   line

end

Fastest way to skip data that we’re not interested in is to use read to advance through a portion of the file.

File.open("/path/10gb_file.csv") do |f| 
  f.seek(107)  # skip 107 bytes eg. one line. (constant time)
  f.read(50)   # read first 50 on second line
end
Marko Tunjic
  • 1,811
  • 1
  • 14
  • 15
  • Yes, but I'm looking for a way to start reading at an arbitrary line. E.g. Start at line 100,000 and not line 1. Wouldn't adding `drop` like this `CSV.foreach(file, headers: true).drop(row_number) { |row| "..load data..." }` cause the whole file to be read into memory before the drop method is used? – dwyd May 26 '16 at 19:52
  • Skimming through the file with foreach could actually be ok since this method does not need to *keep* the whole file in memory, just a small part of it at a time. It might take some time, though. – Raffael May 26 '16 at 19:59
  • @Raffael, It works well if the data is towards the beginning of the file but it takes time to get to the data towards the end of the file. If I need the first 100,000 rows it takes about 5 minutes to load into a hashes array. If I need 100,000 rows towards the middle of the file, it could take 30 minutes or more on my local machine because it needs to skip all the unneeded rows first. – dwyd May 26 '16 at 20:12
  • How is this approach in comparison? `CSV.open(file, headers: true).drop(20).each do |row| ...` (Can't try it out myself since I don't have a file that big laying around :)) – Raffael May 26 '16 at 20:19
  • @Raffael, based on [this answer](http://stackoverflow.com/a/22898988/4322755) and what I tested, it loads the file into memory. However, I think I figured out another approach to the problem. :) I'll post it once I confirm it works. Thanks for the help! – dwyd May 26 '16 at 20:26
  • Do you want to start at an arbitrary line or an arbitrary record? The distinction is important since quotes fields can span multiple lines in a CSV file. – Jordan Running May 26 '16 at 20:30
  • @Jordan, I'd say an arbitrary line would suffice. If possible, my plan was to find the first line that matched a string I was looking for; break out of the loop; and then parse with the `CSV` class. I think I figured out an alternate solution but would still be interested in knowing how to start at an arbitrary line. – dwyd May 26 '16 at 20:39