63

I'm processing huge data files (millions of lines each).

Before I start processing I'd like to get a count of the number of lines in the file, so I can then indicate how far along the processing is.

Because of the size of the files, it would not be practical to read the entire file into memory, just to count how many lines there are. Does anyone have a good suggestion on how to do this?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
smnirven
  • 1,518
  • 1
  • 13
  • 15

15 Answers15

79

Reading the file a line at a time:

count = File.foreach(filename).inject(0) {|c, line| c+1}

or the Perl-ish

File.foreach(filename) {}
count = $.

or

count = 0
File.open(filename) {|f| count = f.read.count("\n")}

Will be slower than

count = %x{wc -l #{filename}}.split.first.to_i
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • 7
    The last one is the cleanest, since we can assume "wc" is optimized for good I/O speeds. The ".split.first" are superfluous, and don't forget to add single quotes around the filename, or it will fail on filenames that have spaces. Simplified: %x{wc -l '#{filename}'}.to_i – deafgreatdane Aug 18 '11 at 18:17
  • 4
    or `count = %x{wc -l < "#{filename}"}.to_i` – glenn jackman Apr 24 '12 at 16:33
  • 5
    @deafgreatdane I don't think `wc` is "clean". Now it won't run on Windows .. I would take a small performance hit (within the same complexity class) to avoid portability issues. –  Mar 07 '13 at 17:42
75

If you are in a Unix environment, you can just let wc -l do the work.

It will not load the whole file into memory; since it is optimized for streaming file and count word/line the performance is good enough rather then streaming the file yourself in Ruby.

SSCCE:

filename = 'a_file/somewhere.txt'
line_count = `wc -l "#{filename}"`.strip.split(' ')[0].to_i
p line_count

Or if you want a collection of files passed on the command line:

wc_output = `wc -l "#{ARGV.join('" "')}"`
line_count = wc_output.match(/^ *([0-9]+) +total$/).captures[0].to_i
p line_count
Jason
  • 11,709
  • 9
  • 66
  • 82
DJ.
  • 6,664
  • 1
  • 33
  • 48
  • 3
    WC is so fast that one probably won't need a progress counter. – Wayne Conrad Apr 16 '10 at 05:36
  • 9
    There's an edge condition: if the last line of the file doesn't have a newline, wc comes up one short. This is by posix design, see http://backreference.org/2010/05/23/sanitizing-files-with-no-trailing-newline/ – deafgreatdane Oct 19 '11 at 17:11
  • 5
    please do more than just cite the method, cite an example of wc -l in practice. not everybody knows the things which are obvious to you. (I know, "Google harder!"...but if we could all be more like Ruby, we would do this instinctively.) – boulder_ruby Jul 20 '12 at 07:02
  • 3
    Solution for the edge condition (if the last line doesn't have newline): `count = %x{sed -n '=' #{file} | wc -l}.to_i ` Reference: http://stackoverflow.com/questions/12616039/wc-command-of-mac-showing-one-less-result – awaage Jan 28 '14 at 07:40
  • line_count = `wc -l "#{filename}" | cut -d" " -f1`.to_i – isqad Sep 16 '15 at 04:48
  • It is important to note that `filename` should be a trusted input when calling `wc -l "#{filename}"`. Otherwise I suggest using `Shellwords.escape`. The issue being for instance `filename = "file; echo pwnd"`. – Ulysse BN Jun 08 '20 at 13:41
15

It doesn't matter what language you're using, you're going to have to read the whole file if the lines are of variable length. That's because the newlines could be anywhere and theres no way to know without reading the file (assuming it isn't cached, which generally speaking it isn't).

If you want to indicate progress, you have two realistic options. You can extrapolate progress based on assumed line length:

assumed lines in file = size of file / assumed line size
progress = lines processed / assumed lines in file * 100%

since you know the size of the file. Alternatively you can measure progress as:

progress = bytes processed / size of file * 100%

This should be sufficient.

cletus
  • 616,129
  • 168
  • 910
  • 942
12

using ruby:

file=File.open("path-to-file","r")
file.readlines.size

39 milliseconds faster then wc -l on a 325.477 lines file

JBoy
  • 5,398
  • 13
  • 61
  • 101
  • 3
    Your system's `wc` has issues. – Clint Pachl May 05 '14 at 19:28
  • 1
    I am quite sure you cannot do this with a huge file because its content will be entirely held in memory. – Pikachu Sep 29 '14 at 00:43
  • 1
    Using `readlines` or `read` is not scalable. It's also slower than reading line-by-line using `foreach` once the files get beyond 1MB in size. See http://stackoverflow.com/q/25189262/128421 – the Tin Man Jul 04 '16 at 20:39
12

Summary of the posted solutions

require 'benchmark'
require 'csv'

filename = "name.csv"

Benchmark.bm do |x|
  x.report { `wc -l < #{filename}`.to_i }
  x.report { File.open(filename).inject(0) { |c, line| c + 1 } }
  x.report { File.foreach(filename).inject(0) {|c, line| c+1} }
  x.report { File.read(filename).scan(/\n/).count }
  x.report { CSV.open(filename, "r").readlines.count }
end

File with 807802 lines:

       user     system      total        real
   0.000000   0.000000   0.010000 (  0.030606)
   0.370000   0.050000   0.420000 (  0.412472)
   0.360000   0.010000   0.370000 (  0.374642)
   0.290000   0.020000   0.310000 (  0.315488)
   3.190000   0.060000   3.250000 (  3.245171)
Exsemt
  • 1,048
  • 10
  • 22
  • [I've rewritten a bit your benchmark](https://stackoverflow.com/a/60434970/6320039), if you'd rather like I to edit your answer, just ping me here :) – Ulysse BN Feb 27 '20 at 14:07
7

DISCLAIMER: the already existing benchmark used count rather than length or size (which is notoriously slower in ruby). And was tedious to read IMHO. Hence this new answer.

Benchmark

require "benchmark"
require "benchmark/ips"
require "csv"

filename = ENV.fetch("FILENAME")

Benchmark.ips do |x|
  x.report("wc") { `wc -l #{filename}`.to_i }
  x.report("open") { File.open(filename).inject(0, :next) }
  x.report("foreach") { File.foreach(filename).inject(0, :next) }
  x.report("foreach $.") { File.foreach(filename) {}; $. }
  x.report("read.scan.length") { File.read(filename).scan(/\n/).length }
  x.report("CSV.open.readlines") { CSV.open(filename, "r").readlines.length }
  x.report("IO.readlines.length") { IO.readlines(filename).length }

  x.compare!
end

On my MacBook Pro (2017) with a 2.3 GHz Intel Core i5 processor:

Warming up --------------------------------------
                  wc     8.000  i/100ms
                open     2.000  i/100ms
             foreach     2.000  i/100ms
          foreach $.     2.000  i/100ms
    read.scan.length     2.000  i/100ms
  CSV.open.readlines     1.000  i/100ms
 IO.readlines.length     2.000  i/100ms
Calculating -------------------------------------
                  wc    115.014  (±21.7%) i/s -    552.000  in   5.020531s
                open     22.450  (±26.7%) i/s -    104.000  in   5.049692s
             foreach     32.669  (±27.5%) i/s -    150.000  in   5.046793s
          foreach $.     25.244  (±31.7%) i/s -    112.000  in   5.020499s
    read.scan.length     44.102  (±31.7%) i/s -    190.000  in   5.033218s
  CSV.open.readlines      2.395  (±41.8%) i/s -     12.000  in   5.262561s
 IO.readlines.length     36.567  (±27.3%) i/s -    162.000  in   5.089395s

Comparison:
                  wc:      115.0 i/s
    read.scan.length:       44.1 i/s - 2.61x  slower
 IO.readlines.length:       36.6 i/s - 3.15x  slower
             foreach:       32.7 i/s - 3.52x  slower
          foreach $.:       25.2 i/s - 4.56x  slower
                open:       22.4 i/s - 5.12x  slower
  CSV.open.readlines:        2.4 i/s - 48.02x  slower

This was made with a file containing 75 516 lines, and 3 532 510 characters (~47 chars per line). You should try this with your own file/dimensions and computer for a precise result.

Ulysse BN
  • 10,116
  • 7
  • 54
  • 82
3

Same as DJ's answer, but giving the actual Ruby code:

count = %x{wc -l file_path}.split[0].to_i

The first part

wc -l file_path

Gives you

num_lines file_path

The split and to_i put that into a number.

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
justingordon
  • 12,553
  • 12
  • 72
  • 116
2

For reasons I don't fully understand, scanning the file for newlines using File seems to be a lot faster than doing CSV#readlines.count.

The following benchmark used a CSV file with 1,045,574 lines of a data and 4 columns:

       user     system      total        real
   0.639000   0.047000   0.686000 (  0.682000)
  17.067000   0.171000  17.238000 ( 17.221173)

The code for the benchmark is below:

require 'benchmark'
require 'csv'

file = "1-25-2013 DATA.csv"

Benchmark.bm do |x|
    x.report { File.read(file).scan(/\n/).count }
    x.report { CSV.open(file, "r").readlines.count }
end

As you can see, scanning the file for newlines is an order of magnitude faster.

fbonetti
  • 6,652
  • 3
  • 34
  • 32
  • The scan forces in-memory and the file object is leaked. –  Mar 07 '13 at 17:44
  • For a 10M lines file I see IOError: File too large. – so_mv Mar 15 '13 at 00:21
  • 1
    `File.read` loads it and does nothing to it. `CSV.open(...).readlines` reads and parses every line into arrays. You're comparing apples to oranges so, of course, there is going to be a big difference in speed. – the Tin Man Apr 18 '13 at 18:58
  • Not to mention a CSV entry could have newlines in a field without actually being a new csv row itself -- the `File.read` will not recognize this. – nzifnab Dec 05 '14 at 22:39
2

I have this one liner.

puts File.foreach('myfile.txt').count
vikas027
  • 5,282
  • 4
  • 39
  • 51
2

The test results for more than 135k lines are shown below. This is my benchmark code.

 file_name = '100m.csv'
 Benchmark.bm do |x|
   x.report { File.new(file_name).readlines.size }
   x.report { `wc -l "#{file_name}"`.strip.split(' ')[0].to_i }
   x.report { File.read(file_name).scan(/\n/).count }
 end

result is

   user     system      total        real
 0.100000   0.040000   0.140000 (  0.143636)
 0.000000   0.000000   0.090000 (  0.093293)
 0.380000   0.060000   0.440000 (  0.464925)

The wc -l code has one problem. If there is only one line in the file and the last character does not end with \n, then count is zero.

So, I recommend calling wc when you count more then one line.

SeongSu
  • 96
  • 5
1

If the file is a CSV file, the length of the records should be pretty uniform if the content of the file is numeric. Wouldn't it make sense to just divide the size of the file by the length of the record or a mean of the first 100 records.

ptcesq
  • 43
  • 7
0

With UNIX style text files, it's very simple

f = File.new("/path/to/whatever")
num_newlines = 0
while (c = f.getc) != nil
  num_newlines += 1 if c == "\n"
end

That's it. For MS Windows text files, you'll have to check for a sequence of "\r\n" instead of just "\n", but that's not much more difficult. For Mac OS Classic text files (as opposed to Mac OS X), you would check for "\r" instead of "\n".

So, yeah, this looks like C. So what? C's awesome and Ruby is awesome because when a C answer is easiest that's what you can expect your Ruby code to look like. Hopefully your dain hasn't already been bramaged by Java.

By the way, please don't even consider any of the answers above that use the IO#read or IO#readlines method in turn calling a String method on what's been read. You said you didn't want to read the whole file into memory and that's exactly what these do. This is why Donald Knuth recommends people understand how to program closer to the hardware because if they don't they'll end up writing "weird code". Obviously you don't want to code close to the hardware whenever you don't have to, but that should be common sense. However you should learn to recognize the instances which you do have to get closer to the nuts and bolts such as this one.

And don't try to get more "object oriented" than the situation calls for. That's an embarrassing trap for newbies who want to look more sophisticated than they really are. You should always be glad for the times when the answer really is simple, and not be disappointed when there's no complexity to give you the opportunity to write "impressive" code. However if you want to look somewhat "object oriented" and don't mind reading an entire line into memory at a time (i.e., you know the lines are short enough), you can do this

f = File.new("/path/to/whatever")
num_newlines = 0
f.each_line do
  num_newlines += 1
end

This would be a good compromise but only if the lines aren't too long in which case it might even run more quickly than my first solution.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • While this has some good information, such as "don't try to get more "object oriented" than the situation calls for", the code isn't best-practice: You're failing to close your file handle. Either do it explicitly, or use a block with `File.open` and let Ruby do it for you. However, using `File.foreach` with a counter inside would be even simpler and just as fast. – the Tin Man Dec 18 '13 at 17:07
0

Using foreach without inject is about 3% faster than with inject. Both are very much faster (more than 100x in my experience) than using getc.

Using foreach without inject can also be slightly simplified (relative to the snippet given elsewhere in this thread) as follows:

count = 0;  File.foreach(path) { count+=1}
puts "count: #{count}"
peak
  • 105,803
  • 17
  • 152
  • 177
0

wc -l in Ruby with less memory, the lazy way:

(ARGV.length == 0 ?
 [["", STDIN]] :
    ARGV.lazy.map { |file_name|
        [file_name, File.open(file_name)]
})
.map { |file_name, file|
    "%8d %s\n" % [*file
                    .each_line
                    .lazy
                    .map { |line| 1 }
                    .reduce(:+), file_name]
}
.each(&:display)

as originally shown by Shugo Maeda.

Example:

$ curl -s -o wc.rb -L https://git.io/vVrQi
$ chmod u+x wc.rb
$ ./wc.rb huge_data_file.csv
  43217291 huge_data_file.csv
altamic
  • 35
  • 2
-2

You can read the last line only and see its number:

f = File.new('huge-file')
f.readlines[-1]
count = f.lineno
D.S.
  • 1