80

I commonly work with text files of ~20 Gb size and I find myself counting the number of lines in a given file very often.

The way I do it now it's just cat fname | wc -l, and it takes very long. Is there any solution that'd be much faster?

I work in a high performance cluster with Hadoop installed. I was wondering if a map reduce approach could help.

I'd like the solution to be as simple as one line run, like the wc -l solution, but not sure how feasible it is.

Any ideas?

Du-Lacoste
  • 11,530
  • 2
  • 71
  • 51
Dnaiel
  • 7,622
  • 23
  • 67
  • 126
  • Do each of the nodes already have a copy of the file? – Ignacio Vazquez-Abrams Oct 03 '12 at 20:45
  • Thanks. yes. but to access many nodes I use an LSF system which sometimes exhibits quite an annoying waiting time, that's why the ideal solution would be to use hadoop/mapreduce in one node but it'd be possible to use other nodes (then adding the waiting time may make it slower than just the cat wc approach) – Dnaiel Oct 03 '12 at 20:47
  • 4
    `wc -l fname` may be faster. You can also try `vim -R fname` if that is faster (it should tell you the number of lines after startup). – ott-- Oct 03 '12 at 20:50
  • 1
    you can do it with a pig script see my reply here: http://stackoverflow.com/questions/9900761/pig-how-to-count-a-number-of-rows-in-alias – Arnon Rotem-Gal-Oz Oct 04 '12 at 04:35
  • 1
    Somewhat faster is to remember the [useless use of cat](https://en.wikipedia.org/wiki/Cat_%28Unix%29#Useless_use_of_cat) rule. – arielf Oct 16 '15 at 22:44
  • Fastest way is `gawk 'END {print NR}' file_name` – EsmaeelE Dec 11 '22 at 07:55

15 Answers15

116

Try: sed -n '$=' filename

Also cat is unnecessary: wc -l filename is enough in your present way.

P.P
  • 117,907
  • 20
  • 175
  • 238
  • mmm interesting. would a map/reduce approach help? I assume if I save all the files in a HDFS format, and then try to count the lines using map/reduce would be much faster, no? – Dnaiel Oct 03 '12 at 20:50
  • @lvella. It depends how they are implemented. In my experience I have seen `sed` is faster. Perhaps, a little benchmarking can help understand it better. – P.P Oct 03 '12 at 20:52
  • @KingsIndian. Indeeed, just tried sed and it was 3 fold faster than wc in a 3Gb file. Thanks KingsIndian. – Dnaiel Oct 03 '12 at 21:06
  • @Dnaiel Thanks for the info. I thought of doing it myself :) – P.P Oct 03 '12 at 21:07
  • 35
    @Dnaiel If I would guess I'd say you ran `wc -l filename` first, then you ran `sed -n '$=' filename`, so that in the first run wc had to read all the file from the disk, so it could be cached entirely on your probably bigger than 3Gb memory, so `sed` could run much more quickly right next. I did the tests myself with a 4Gb file on a machine with 6Gb RAM, but I made sure the file was already on the cache; the score: `sed` - 0m12.539s, `wc -l` - 0m1.911s. So `wc` was 6.56 times faster. Redoing the experiment but clearing the cache before each run, they both took about 58 seconds to complete. – lvella Oct 03 '12 at 21:50
  • 2
    This solution using sed has the added advantage of not requiring an end of line character. wc counts end of line characters ("\n"), so if you have, say, one line in the file without a \n, then wc will return 0. sed will correctly return 1. – Sevak Avakians Dec 12 '17 at 21:01
16

Your limiting speed factor is the I/O speed of your storage device, so changing between simple newlines/pattern counting programs won't help, because the execution speed difference between those programs are likely to be suppressed by the way slower disk/storage/whatever you have.

But if you have the same file copied across disks/devices, or the file is distributed among those disks, you can certainly perform the operation in parallel. I don't know specifically about this Hadoop, but assuming you can read a 10gb the file from 4 different locations, you can run 4 different line counting processes, each one in one part of the file, and sum their results up:

$ dd bs=4k count=655360 if=/path/to/copy/on/disk/1/file | wc -l &
$ dd bs=4k skip=655360 count=655360 if=/path/to/copy/on/disk/2/file | wc -l &
$ dd bs=4k skip=1310720 count=655360 if=/path/to/copy/on/disk/3/file | wc -l &
$ dd bs=4k skip=1966080 if=/path/to/copy/on/disk/4/file | wc -l &

Notice the & at each command line, so all will run in parallel; dd works like cat here, but allow us to specify how many bytes to read (count * bs bytes) and how many to skip at the beginning of the input (skip * bs bytes). It works in blocks, hence, the need to specify bs as the block size. In this example, I've partitioned the 10Gb file in 4 equal chunks of 4Kb * 655360 = 2684354560 bytes = 2.5GB, one given to each job, you may want to setup a script to do it for you based on the size of the file and the number of parallel jobs you will run. You need also to sum the result of the executions, what I haven't done for my lack of shell script ability.

If your filesystem is smart enough to split big file among many devices, like a RAID or a distributed filesystem or something, and automatically parallelize I/O requests that can be paralellized, you can do such a split, running many parallel jobs, but using the same file path, and you still may have some speed gain.

EDIT: Another idea that occurred to me is, if the lines inside the file have the same size, you can get the exact number of lines by dividing the size of the file by the size of the line, both in bytes. You can do it almost instantaneously in a single job. If you have the mean size and don't care exactly for the the line count, but want an estimation, you can do this same operation and get a satisfactory result much faster than the exact operation.

lvella
  • 12,754
  • 11
  • 54
  • 106
11

As per my test, I can verify that the Spark-Shell (based on Scala) is way faster than the other tools (GREP, SED, AWK, PERL, WC). Here is the result of the test that I ran on a file which had 23782409 lines

time grep -c $ my_file.txt;

real 0m44.96s user 0m41.59s sys 0m3.09s

time wc -l my_file.txt;

real 0m37.57s user 0m33.48s sys 0m3.97s

time sed -n '$=' my_file.txt;

real 0m38.22s user 0m28.05s sys 0m10.14s

time perl -ne 'END { $_=$.;if(!/^[0-9]+$/){$_=0;};print "$_" }' my_file.txt;

real 0m23.38s user 0m20.19s sys 0m3.11s

time awk 'END { print NR }' my_file.txt;

real 0m19.90s user 0m16.76s sys 0m3.12s

spark-shell
import org.joda.time._
val t_start = DateTime.now()
sc.textFile("file://my_file.txt").count()
val t_end = DateTime.now()
new Period(t_start, t_end).toStandardSeconds()

res1: org.joda.time.Seconds = PT15S

Pramod Tiwari
  • 149
  • 1
  • 8
  • 1
    You can just prefix your command with `time` to get the runtime. – Javad Oct 19 '16 at 23:03
  • just realized that I had AIX based system on which I was performing these tests and it does not support the time keyword the way i was expecting it to work out – Pramod Tiwari Nov 21 '16 at 14:47
  • FWIW, I don't think you can count on these times being consistent across all OS'es "wc -l" was faster than awk for me counting lines on a 1.1gb log file. Sed was slow though. Thanks for showing the options though! – Peter Turner Aug 28 '18 at 15:49
  • I completely agree with you. It would certainly depend a lot on these utility's optimization on different OSes. I am not sure how these small utilities are designed in different flavors. Thanks for bringing in that perspective. – Pramod Tiwari Sep 30 '18 at 13:00
  • @PramodTiwari What is the meaning of `PT15S` ? – SebMa Aug 19 '22 at 08:34
8

On a multi-core server, use GNU parallel to count file lines in parallel. After each files line count is printed, bc sums all line counts.

find . -name '*.txt' | parallel 'wc -l {}' 2>/dev/null | paste -sd+ - | bc

To save space, you can even keep all files compressed. The following line uncompresses each file and counts its lines in parallel, then sums all counts.

find . -name '*.xz' | parallel 'xzcat {} | wc -l' 2>/dev/null | paste -sd+ - | bc
Nicholas Sushkin
  • 13,050
  • 3
  • 30
  • 20
  • Good idea. I'm using this. See my answer about using `dd` instead of `wc` to read the file if disk bottleneck is an issue. – sudo May 06 '17 at 01:42
6

If your data resides on HDFS, perhaps the fastest approach is to use hadoop streaming. Apache Pig's COUNT UDF, operates on a bag, and therefore uses a single reducer to compute the number of rows. Instead you can manually set the number of reducers in a simple hadoop streaming script as follows:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -Dmapred.reduce.tasks=100 -input <input_path> -output <output_path> -mapper /bin/cat -reducer "wc -l"

Note that I manually set the number of reducers to 100, but you can tune this parameter. Once the map-reduce job is done, the result from each reducer is stored in a separate file. The final count of rows is the sum of numbers returned by all reducers. you can get the final count of rows as follows:

$HADOOP_HOME/bin/hadoop fs -cat <output_path>/* | paste -sd+ | bc
Pirooz
  • 1,268
  • 1
  • 13
  • 24
6

I know the question is a few years old now, but expanding on Ivella's last idea, this bash script estimates the line count of a big file within seconds or less by measuring the size of one line and extrapolating from it:

#!/bin/bash
head -2 $1 | tail -1 > $1_oneline
filesize=$(du -b $1 | cut -f -1)
linesize=$(du -b $1_oneline | cut -f -1)
rm $1_oneline
echo $(expr $filesize / $linesize)

If you name this script lines.sh, you can call lines.sh bigfile.txt to get the estimated number of lines. In my case (about 6 GB, export from database), the deviation from the true line count was only 3%, but ran about 1000 times faster. By the way, I used the second, not first, line as the basis, because the first line had column names and the actual data started in the second line.

Nico
  • 151
  • 2
  • 6
  • For above all answers I tried with (i) cat filename | wc -l # giving me wrong answer (ii) sed -n '$=' filename #giving me wrong result. Then I tried with this script and gave me correct result around 1 million lines. Thanks +1 – Sanket Thakkar Jul 13 '17 at 10:32
  • 1
    You actually could do not the head but the tail in the first line. And why 1, take 1000, and multiply it back at the end. if lines more or less random, it will give you more precise result then using 1 line calc.The problem is if recordset is poorly distributed. Then this number worth nothing :( – Алексей Лещук Jun 18 '19 at 17:28
4

If your bottleneck is the disk, it matters how you read from it. dd if=filename bs=128M | wc -l is a lot faster than wc -l filename or cat filename | wc -l for my machine that has an HDD and fast CPU and RAM. You can play around with the block size and see what dd reports as the throughput. I cranked it up to 1GiB.

Note: There is some debate about whether cat or dd is faster. All I claim is that dd can be faster, depending on the system, and that it is for me. Try it for yourself.

sudo
  • 5,604
  • 5
  • 40
  • 78
3

Hadoop is essentially providing a mechanism to perform something similar to what @Ivella is suggesting.

Hadoop's HDFS (Distributed file system) is going to take your 20GB file and save it across the cluster in blocks of a fixed size. Lets say you configure the block size to be 128MB, the file would be split into 20x8x128MB blocks.

You would then run a map reduce program over this data, essentially counting the lines for each block (in the map stage) and then reducing these block line counts into a final line count for the entire file.

As for performance, in general the bigger your cluster, the better the performance (more wc's running in parallel, over more independent disks), but there is some overhead in job orchestration that means that running the job on smaller files will not actually yield quicker throughput than running a local wc

Chris White
  • 29,949
  • 4
  • 71
  • 93
3

I'm not sure that python is quicker:

[root@myserver scripts]# time python -c "print len(open('mybigfile.txt').read().split('\n'))"

644306


real    0m0.310s
user    0m0.176s
sys     0m0.132s

[root@myserver scripts]# time  cat mybigfile.txt  | wc -l

644305


real    0m0.048s
user    0m0.017s
sys     0m0.074s
A.L
  • 10,259
  • 10
  • 67
  • 98
eugene
  • 31
  • 1
  • 1
    you are actually showing that python is slower here. – Arnaud Potier May 05 '15 at 08:26
  • 1
    Python could do the job, but **certainly** not with `...read().split("\n")` . change that for `sum(1 for line in open("mybigfile.txt")) ` and you have a better naive approach (i..e not taking any advantage from the HDFS setup) – jsbueno May 06 '15 at 21:28
  • @Arnaud Potier, I suspect this post is in response to another solution which recommended python. – MinneapolisCoder9 Sep 08 '22 at 19:00
1

If your computer has python, you can try this from the shell:

python -c "print len(open('test.txt').read().split('\n'))"

This uses python -c to pass in a command, which is basically reading the file, and splitting by the "newline", to get the count of newlines, or the overall length of the file.

@BlueMoon's:

bash-3.2$ sed -n '$=' test.txt
519

Using the above:

bash-3.2$ python -c "print len(open('test.txt').read().split('\n'))"
519
ZenOfPython
  • 891
  • 6
  • 15
  • 8
    Having python parse for every \n in a 20GB file seems like a pretty terribly slow way to try to do this. – mikeschuld Dec 17 '14 at 22:14
  • 1
    Terrible solution compared to using sed. – PureW Apr 14 '15 at 07:57
  • 2
    The problem is not python parsing the "\n" - both sed and wc will have to do that as well. What is terrible about this is _reading everything into memory, and them asking Python to split the block of data at each "\n" (not only duplicating all data in memory, but also performing a relatively expensive object creation for each line) – jsbueno May 06 '15 at 21:30
  • ``python -c "print(sum(1 for line in open('text.txt'))"`` would be better solution in _python_ because it doesn't read the entire file into memory but either sed or wc would be a much better solution. – zombieguru May 12 '16 at 14:44
1
find  -type f -name  "filepattern_2015_07_*.txt" -exec ls -1 {} \; | cat | awk '//{ print $0 , system("cat " $0 "|" "wc -l")}'

Output:

bigbounty
  • 16,526
  • 5
  • 37
  • 65
1

I have a 645GB text file, and none of the earlier exact solutions (e.g. wc -l) returned an answer within 5 minutes.

Instead, here is Python script that computes the approximate number of lines in a huge file. (My text file apparently has about 5.5 billion lines.) The Python script does the following:

A. Counts the number of bytes in the file.

B. Reads the first N lines in the file (as a sample) and computes the average line length.

C. Computes A/B as the approximate number of lines.

It follows along the line of Nico's answer, but instead of taking the length of one line, it computes the average length of the first N lines.

Note: I'm assuming an ASCII text file, so I expect the Python len() function to return the number of chars as the number of bytes.

Put this code into a file line_length.py:

#!/usr/bin/env python

# Usage:
# python line_length.py <filename> <N> 

import os
import sys
import numpy as np

if __name__ == '__main__':

    file_name = sys.argv[1]
    N = int(sys.argv[2]) # Number of first lines to use as sample.
    file_length_in_bytes = os.path.getsize(file_name)
    lengths = [] # Accumulate line lengths.
    num_lines = 0

    with open(file_name) as f:
        for line in f:
            num_lines += 1
            if num_lines > N:
                break
            lengths.append(len(line))

    arr = np.array(lengths)
    lines_count = len(arr)
    line_length_mean = np.mean(arr)
    line_length_std = np.std(arr)

    line_count_mean = file_length_in_bytes / line_length_mean

    print('File has %d bytes.' % (file_length_in_bytes))
    print('%.2f mean bytes per line (%.2f std)' % (line_length_mean, line_length_std))
    print('Approximately %d lines' % (line_count_mean))

Invoke it like this with N=5000.

% python line_length.py big_file.txt 5000

File has 645620992933 bytes.
116.34 mean bytes per line (42.11 std)
Approximately 5549547119 lines

So there are about 5.5 billion lines in the file.

stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217
0

Let us assume:

  • Your file system is distributed
  • Your file system can easily fill the network connection to a single node
  • You access your files like normal files

then you really want to chop the files into parts, count parts in parallel on multiple nodes and sum up the results from there (this is basically @Chris White's idea).

Here is how you do that with GNU Parallel (version > 20161222). You need to list the nodes in ~/.parallel/my_cluster_hosts and you must have ssh access to all of them:

parwc() {
    # Usage:
    #   parwc -l file                                                                

    # Give one chunck per host                                                     
    chunks=$(cat ~/.parallel/my_cluster_hosts|wc -l)
    # Build commands that take a chunk each and do 'wc' on that                    
    # ("map")                                                                      
    parallel -j $chunks --block -1 --pipepart -a "$2" -vv --dryrun wc "$1" |
        # For each command                                                         
        #   log into a cluster host                                                
        #   cd to current working dir                                              
        #   execute the command                                                    
        parallel -j0 --slf my_cluster_hosts --wd . |
        # Sum up the number of lines                                               
        # ("reduce")                                                               
        perl -ne '$sum += $_; END { print $sum,"\n" }'
}

Use as:

parwc -l myfile
parwc -w myfile
parwc -c myfile
Ole Tange
  • 31,768
  • 5
  • 86
  • 104
0

With slower IO falling back to dd if={file} bs=128M | wc -l helps tremendously while gathering data for wc to churn through.

I've also stumbled upon

https://github.com/crioux/turbo-linecount

which is great.

Henry Tseng
  • 3,263
  • 1
  • 19
  • 20
0

You could use the following and is pretty fast:

wc -l filename #assume file got 50 lines then output -> 50 filename

In addition, if you just want to the get the number without displaying the file name. You may do this trick. This will only get you the number of lines in the file without displaying its name.

wc -l filename | cut -f1 -d ' ' #space will be delimiter hence output -> 50
Du-Lacoste
  • 11,530
  • 2
  • 71
  • 51