How do I find the percent complete when parsing a file?

Question

How can I print what percentage of a file I have already parsed. I am parsing a text file, so I use:

file.each_line do

Is there a method like each_with_index that is available to use with strings?

This is how I currently use each_with_index to find percentage complete:

amount = 10000000
file.each_with_index do |line, index|
      if index == amount
        break
      end
      print "%.1f%% done" % (index/(amount * 1.0) * 100)
      print "\r"

If you are treating the file as a stream (`each_line`), how can it be known in advance how many lines there are? — matt, Apr 18 '13 at 17:19
possible duplicate of [Count the number of lines in a file with Ruby, without reading entire file into memory](http://stackoverflow.com/questions/2650517/count-the-number-of-lines-in-a-file-with-ruby-without-reading-entire-file-into) — the Tin Man, Apr 18 '13 at 18:55

the Tin Man · Answer 1 · 2013-04-18T19:03:40.983

To get the number of lines, you can do a couple different things.

If you are on Linux or Mac OS, take advantage of the underlying OS and ask it how many lines are in the file:

lines_in_file = `wc -l #{ path_to_file_to_read }`

wc is extremely fast, and can tell you about lines, words and characters. -l specifies lines.

If you want to do it in Ruby, you could use File.readlines('/path/to/file/to/read') or File.read('/path/to/file/to/read').lines, but be very careful. Both will read the entire file into memory, and, if that file is bigger than your available RAM you've just beaten your machine to a slow death. So, don't do that.

Instead use something like:

lines_in_file = 0
File.foreach('/path/to/file/to/read') { lines_in_file += 1 }

After running, lines_in_file will hold the number of lines in the file. File.foreach is VERY fast, pretty much equal to using File.readlines and probably faster than File.read().lines, and it only reads a line at a time so you're not filling your RAM.

If you want to know the current line number of the line you just read from a file, you can use Ruby's $..

You're concerned about "percentage of a file" though. A potential problem with this is lines are variable length. Depending on what you are doing with them, the line length could have a big effect on your progress meter. You might want to look at the actual length of the file and keep track of the number of characters consumed by reading each line, so your progress is based on percentage of characters, rather than percentage of lines.

A File instance has a `lineno` method (more readable than `$.`) — steenslag, Apr 18 '13 at 21:15
@steenslag, good point. I keep forgetting `lineno`, probably because my years of writing in Perl corrupted my memory banks. — the Tin Man, Apr 18 '13 at 21:31
@theTinMan What's wrong with Perl!? :] Although I agree, it rears it's head far too often when I write code in other languages. — squiguy, Apr 18 '13 at 21:37
What's wrong with it? Nothing. It's a great language and I wrote in it from v2 forward, but I grew tired of line noise. — the Tin Man, Apr 19 '13 at 00:35

score 3 · Accepted Answer · edited Apr 18 '13 at 21:34

3

Get all the lines upfront, then display the progress as you perform whatever operation you need on them.

lines = file.readlines
amount = lines.length

lines.each_with_index do |line, index|
  if index == amount
    break
  end
  print "%.1f%% done" % (index/(amount * 1.0) * 100)
  print "\r"
end

edited Apr 18 '13 at 21:34

steenslag

79,051
16
138
171

answered Apr 18 '13 at 17:34

lightswitch05

9,058
7
52
75

Exactly. That is the answer implicit in my comment above. – matt Apr 18 '13 at 17:34
1

Working although memory inefficient solution as whole file is kept in memory. I'd suggest to guess amount by average/median of 1st rows divided by `File::size`, optionally adjust if deviation exceeds bounds. – David Unric Apr 18 '13 at 18:49
2

This isn't a scalable solution. I get files that are much larger than the available RAM on a well-stocked host and this would take the host to its knees. – the Tin Man Apr 18 '13 at 18:52

score 1 · Answer 3 · answered Apr 18 '13 at 19:56

1

Without having to load the file beforehand, you could employ size and pos methods:

f = open('myfile')
while (line = f.gets)
  puts "#{(f.pos*100)/f.size}%\t#{line}"
end

Less lines, less logic and accurate to a byte.

answered Apr 18 '13 at 19:56

maksimov

5,792
1
30
38

+1 But it's a good habit to close the file (`f.close`) when it is opened like this. Also for better performance store the filesize in a variable before the loop. – steenslag Apr 18 '13 at 21:05
Use the block form with `IO` and `File` methods to auto-close them. – the Tin Man Apr 18 '13 at 21:34
@steenslag I realised about the `close` as I was posting this, however I didn't deem it necessary for purposes of demonstration of the core principle (similarly to the dummy code inside the `while`). Thanks for the note however! – maksimov Apr 18 '13 at 23:07
On the storing of the file size in a variable - this was in my first version of the snippet. Then I thought to get rid of it for shortness' sake, with additional side-effect of being able to track a growing file ;-) – maksimov Apr 18 '13 at 23:12

toch · Answer 4 · 2013-04-18T19:21:20.843

0

Rather than reading the whole file and loading it in memory (as with read or readlines), I suggest to use File.foreach reading the file as a stream, line by line.

count = 0
File.foreach('your_file') { count += 1 }
idx = 0
File.foreach('your_file') do |line|
  puts "#{(idx+1).to_f / count * 100}%"
  idx += 1
end

edited Apr 18 '13 at 19:21

answered Apr 18 '13 at 17:26

toch

3,905
2
25
34

Doesn't this read the file 2 times, which takes 2 times a long. This could be a problem. If you need to display a "x% done" for reading a file, it might be pretty time consuming, even without reading it 2 times. – tessi Apr 18 '13 at 17:29
@PhilippTessenow totally agree, but you cannot do otherwise IMO, you need to read the files to count the lines. You cannot deduce it from its size (Except if you know what each line takes). That's why I'd use `wc` (cf my update) – toch Apr 18 '13 at 17:31
What about reading the file's content as a an array of strings, count the lines by checking the size of the array and then parse the strings of the array? – N.N. Apr 18 '13 at 17:33
1

Exactly so, @N.N. That is the whole point of my comment above: if you need to know how many lines there are, the file is no longer a stream; it's just a big string, so you might as well just do a read_lines and have done with it. Now you've got an array and can proceed normally. – matt Apr 18 '13 at 17:34
@matt right, but for big files, it won't be possible as you need to put it in memory – toch Apr 18 '13 at 18:14
"but for big files, it won't be possible as you need to put it in memory", then `file.read` will fail as will `file.lines` because both need to pull it into memory. – the Tin Man Apr 18 '13 at 18:49
@theTinMan Thanks to you I've read about it, I had bad assumptions (due to the rewind, I though it was a true stream read). I've updated my answer. I've learned something :). – toch Apr 18 '13 at 19:07
@toch But that is exactly my point. You cannot first have your entire cake and then propose to eat it only a bite it at a time. If you are truly going to treat the file as a stream, because it is big, then to speak of a "percentage" is a contradiction in terms: the stream flows until it ends, and that's all you know. – matt Apr 18 '13 at 19:34
@matt ok, I understand your point. You're right. It's indeed a non sense. Considering the stream approach, the best we can would be an approximated progression based on estimation of the average bytes a line, updated at each line. – toch Apr 18 '13 at 21:25

How do I find the percent complete when parsing a file?

4 Answers4