Edit (I adjusted the title): I am currently using CSV.foreach
but that starts at the first row. I'd like to start reading a file at an arbitrary line without loading the file into memory. CSV.foreach
works well for retrieving data at the beginning of a file but not for data I need towards the end of a file.
This answer is similar to what I am looking to do but it loads the entire file into memory; which is what I don't want to do.
I have a 10gb file and the key
column is sorted in ascending order:
# example 10gb file rows
key,state,name
1,NY,Jessica
1,NY,Frank
1,NY,Matt
2,NM,Jesse
2,NM,Saul
2,NM,Walt
etc..
I find the line I want to start with this way ...
file = File.expand_path('~/path/10gb_file.csv')
File.open(file, 'rb').each do |line|
if line[/^2,/]
puts "#{$.}: #{line}" # 5: 2,NM,Jesse
row_number = $. # 5
break
end
end
... and I'd like to take row_number
and do something like this but not load the 10gb file into memory:
CSV.foreach(file, headers: true).drop(row_number) { |row| "..load data..." }
Lastly, I'm currently handling it like the next snippet; It works fine when the rows are towards the front of the file but not when they're near the end.
CSV.foreach(file, headers: true) do |row|
next if row['key'].to_i < row_number.to_i
break if row['key'].to_i > row_number.to_i
"..load data..."
end
I am trying to use CSV.foreach
but I'm open to suggestions. An alternative approach I am considering but does not seem to be efficient for numbers towards the middle of a file:
- Use
IO
orFile
and read the file line by line - Get the header row and build the hash manually
- Read the file from the bottom for numbers near the max
key
value