How to get the line count of a large file, at least 5G. the fastest approach using shell.
2 Answers
Step 1: head -n filename > newfile // get the first n lines into newfile,e.g. n =5
Step 2: Get the huge file size, A
Step 3: Get the newfile size,B
Step 4: (A/B)*n is approximately equal to the exact line count.
Set n to be different values,done a few times more, then get the average.

- 1,086
- 12
- 36
-
This gives you an *approximation* of the number of lines in the file. You can't get an exact count without reading the whole file somehow. The estimate can be way off if the first **n** lines happen to be longer or shorter than average. And averaging the results for varying values of **n** seems odd. The largest **n** you try will include the results for all smaller values. Just doing a single measurement for some large **n** is likely to be better than the suggested averaging approach. In any case, the comments on my answer indicate that `wc -l` takes about 90 seconds. – Keith Thompson Mar 24 '17 at 23:12
The fastest approach is likely to be wc -l
.
The wc
command is optimized to do exactly this kind of thing. It's very unlikely that anything else you can do (other than doing it on more powerful hardware) is going to be any faster.
Yes, counting lines in a 5 gigabyte text file is slow. It's a big file.
The only alternative would be to store the data in some different format in the first place, perhaps a database, perhaps a file with fixed-length records. Converting your 5 gigabyte text file to some other format is going to take at least as wrong as running wc -l
on it, but it might be worth it if you're going to be counting lines a lot. It's impossible to say what the tradeoffs are without more information.

- 254,901
- 44
- 429
- 631
-
For that size of a plain text file, `wc` took relatively short time on first call for that file, and ~2 sec for later calls with the same file as input. – 0 _ Feb 01 '15 at 19:13
-
Caching of the file the first time explains this, see comment by @Ivella here: http://stackoverflow.com/a/12716620/1959808 – 0 _ Feb 01 '15 at 19:18
-
-
1After clearing the cache with `sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"`, as suggested [here](http://unix.stackexchange.com/a/148442/43390), the time taken by `wc -l` on first call was 1m31s for a 4.3G file. – 0 _ Feb 01 '15 at 19:30
-