9

I tried searching for this, but couldn't find much. It seems like something that's probably been asked before (many times?), so I apologize if that's the case.

I was wondering what the fastest way to parse certain parts of a file in Ruby would be. For example, suppose I know the information I want for a particular function is between lines 500 and 600 of, say, a 1000 line file. (obviously this kind of question is geared toward much large files, I'm just using those smaller numbers for the sake of example), since I know it won't be in the first half, is there a quick way of disregarding that information?

Currently I'm using something along the lines of:

while  buffer = file_in.gets and file_in.lineno <600
  next unless file_in.lineno > 500
  if buffer.chomp!.include? some_string
    do_func_whatever
  end
end

It works, but I just can't help but think it could work better.

I'm very new to Ruby and am interested in learning new ways of doing things in it.

Andrew Grimm
  • 78,473
  • 57
  • 200
  • 338
DRobinson
  • 4,441
  • 22
  • 31

4 Answers4

11
file.lines.drop(500).take(100) # will get you lines 501-600

Generally, you can't avoid reading file from the start until the line you are interested in, as each line can be of different length. The one thing you can avoid, though, is loading whole file into a big array. Just read line by line, counting, and discard them until you reach what you look for. Pretty much like your own example. You can just make it more Rubyish.

PS. the Tin Man's comment made me do some experimenting. While I didn't find any reason why would drop load whole file, there is indeed a problem: drop returns the rest of the file in an array. Here's a way this could be avoided:

file.lines.select.with_index{|l,i| (501..600) === i}

PS2: Doh, above code, while not making a huge array, iterates through the whole file, even the lines below 600. :( Here's a third version:

enum = file.lines
500.times{enum.next} # skip 500
enum.take(100) # take the next 100

or, if you prefer FP:

file.lines.tap{|enum| 500.times{enum.next}}.take(100)

Anyway, the good point of this monologue is that you can learn multiple ways to iterate a file. ;)

Mladen Jablanović
  • 43,461
  • 10
  • 90
  • 113
  • That does look more Rubyish! I had only really thought about that after I posted the question - the fact that "lines" aren't really set by anything other than "the space between 'new line' characters" (or rather before and after). Which would mean that They all have to be parsed for that character anyway. I guess if I had a general idea of the space preceding the required lines, in bits/bytes/whatever, I could jump that area and then start parsing line by line, but for the time I'll accept that it works pretty well as is. Or as will be with a nicer looking line like your own! Thank you. – DRobinson Feb 19 '11 at 19:31
  • Actually, you _could_ make use of `seek` if the lines contained some kind of information related to their position in file (such as line numbers or sorted timestamps). Then you could pull some variant of binary search. You can open another question if that would help in your particular case. – Mladen Jablanović Feb 19 '11 at 20:34
  • This does lead to some scalability concerns though. If the file has several million lines, its going to be read into memory completely before you can `drop`. That could be slow and make the machine be unresponsive as its loading the data, or fill all available memory if the lines are long, causing paging. For a safer approach with a text file you're better off reading them a line at a time, skipping them until you reach the ones you want, then capture only the needed lines. – the Tin Man Feb 20 '11 at 00:05
  • @the Tin Man: What makes you think it needs to load whole file in order to `drop`? – Mladen Jablanović Feb 20 '11 at 08:05
  • `drop` is in `Array`, which implied it had to have a silent `to_a` first. I just looked and Array gets it from `Enumerable`, and the source code shows it loops over its block `(n)` times throwing away the result. So it doesn't have to load everything into memory; It does have to load the lines sequentially, and throw them away. And, as you say, there are various ways to write it but the end result is the same, the lines get read just to be counted. And, that was my point, that reading lines individually skates around a scalability issue, vs. `slurping` a file which can kill a host. – the Tin Man Feb 20 '11 at 18:37
  • for ruby 2.0.0p247 you should use each_line: warning: IO#lines is deprecated; use #each_line instead – Lucas Renan Oct 09 '13 at 15:22
1

I don't know if there is an equivalent way of doing this for lines, but you can use seek or the offset argument on an IO object to "skip" bytes.

See IO#seek, or see IO#open for information on the offset argument.

coreyward
  • 77,547
  • 20
  • 137
  • 166
  • To find out where a line ends (with an EOL character) there is no way out, you have to read the file byte-by-byte and then drop the read info. If you seek to the 1000th byte you have no way telling how many lines you have skipped. It can be 400 or 1 or even zero. – karatedog Aug 31 '20 at 21:37
0

Sounds like rio might be of help here. It provides you with a lines() method.

s.m.
  • 7,895
  • 2
  • 38
  • 46
  • That just iterates over the lines. That isn't very helpful in this situation. – coreyward Feb 19 '11 at 18:09
  • @coreyward: why not? You can pass it a range and iterate over those lines. Is there something I'm missing? – s.m. Feb 19 '11 at 18:43
  • The built-in IO library does the same thing. – coreyward Feb 19 '11 at 18:52
  • @coreyward: I still don't get it, sorry. The OP is asking for other ways of reading only certain file lines of a file. Does my answer fail at that? You suggested something like `seek`, which is not going to work if you can't know how many bytes you would have to skip (e.g. you don't know how long each record is). – s.m. Feb 19 '11 at 19:11
0

You can use IO#readlines, that returns an array with all the lines

IO.readlines(file_in)[500..600].each do |line| 
  #line is each line in the file (including the last \n)
  #stuff
end

or

f = File.new(file_in)
f.readlines[500..600].each do |line| 
  #line is each line in the file (including the last \n)
  #stuff
end
pablorc
  • 940
  • 1
  • 8
  • 20
  • 3
    This isn't very performance-friendly on large files. Building an array with 500,000 entries simply to access from 230,000 to 230,100 isn't smart. If anything, iterating over each line in the stream and discarding them as needed is smarter because the file doesn't get loaded into memory all at once. – coreyward Feb 19 '11 at 18:15
  • 1
    It could be my implementation (and, of course, I'll keep testing when I have time) but this method seems to be a bit slower, even on small files of about 2000 lines. That said, the difference is quite small at these levels (when I did f = Files.new ... readlines[x..y] ... it took ~0.85 seconds average; the initial method I posted give me about 0.75seconds average) Of course I might not be doing it properly, or very well. I'll do some more testing. – DRobinson Feb 19 '11 at 19:18