23

I have to fetch one specific line out of a big file (1500000 lines), multiple times in a loop over multiple files, I was asking my self what would be the best option (in terms of performance). There are many ways to do this, i manly use these 2

cat ${file} | head -1

or

cat ${file} | sed -n '1p'

I could not find an answer to this do they both only fetch the first line or one of the two (or both) first open the whole file and then fetch the row 1?

Chris Seymour
  • 83,387
  • 30
  • 160
  • 202
JBoy
  • 5,398
  • 13
  • 61
  • 101
  • 2
    Use `time` to measure the commands. – choroba Mar 26 '13 at 08:47
  • 5
    Why pipe `cat` into the tools? They can both open files themselves, and if you are worried about efficiency, they can probably do it better. But, yes, the pipe should "stream" just the first few blocks of the file (and then notice that the consumer stopped caring). – Thilo Mar 26 '13 at 08:48
  • BTW, for a specific line far into a large file, it's highly likely a program in an ahead-of-time compiled language could run even faster than `head "-$pos" "$file" | tail -1`. (Like C, especially with SIMD intrinsics to optimize the counting of newlines over large blocks of memory until you get close to the right starting place. It should be limited only by memory bandwidth after mmaping the file, if already hot in the page-cache.) – Peter Cordes Aug 29 '20 at 04:53

6 Answers6

40

Drop the useless use of cat and do:

$ sed -n '1{p;q}' file

This will quit the sed script after the line has been printed.


Benchmarking script:

#!/bin/bash

TIMEFORMAT='%3R'
n=25
heading=('head -1 file' 'sed -n 1p file' "sed -n '1{p;q} file" 'read line < file && echo $line')

# files upto a hundred million lines (if your on slow machine decrease!!)
for (( j=1; j<=100,000,000;j=j*10 ))
do
    echo "Lines in file: $j"
    # create file containing j lines
    seq 1 $j > file
    # initial read of file
    cat file > /dev/null

    for comm in {0..3}
    do
        avg=0
        echo
        echo ${heading[$comm]}    
        for (( i=1; i<=$n; i++ ))
        do
            case $comm in
                0)
                    t=$( { time head -1 file > /dev/null; } 2>&1);;
                1)
                    t=$( { time sed -n 1p file > /dev/null; } 2>&1);;
                2)
                    t=$( { time sed '1{p;q}' file > /dev/null; } 2>&1);;
                3)
                    t=$( { time read line < file && echo $line > /dev/null; } 2>&1);;
            esac
            avg=$avg+$t
        done
        echo "scale=3;($avg)/$n" | bc
    done
done

Just save as benchmark.sh and run bash benchmark.sh.

Results:

head -1 file
.001

sed -n 1p file
.048

sed -n '1{p;q} file
.002

read line < file && echo $line
0

**Results from file with 1,000,000 lines.*

So the times for sed -n 1p will grow linearly with the length of the file but the timing for the other variations will be constant (and negligible) as they all quit after reading the first line:

enter image description here

Note: timings are different from original post due to being on a faster Linux box.

Chris Seymour
  • 83,387
  • 30
  • 160
  • 202
  • 3
    Or perhaps `sed 1q file` which is a little less busy. – potong Mar 26 '13 at 09:11
  • @potong I used this format so I can be used to print any single line in the file. – Chris Seymour Mar 26 '13 at 09:13
  • 1
    Ideally you should recreate the file each time. Depending on the filesystem, caching can affect timings such that the first run does the real I/O and subsequent runs benefit. – cdarke Mar 26 '13 at 10:49
  • 1
    +1 for the detailed performance comparison. btw, in your script, the sed line (`sed 1q`) in `case` and `heading` are different. :) it would be good to make them same particularly for performance testing. anyway, nice answer! – Kent Mar 26 '13 at 11:38
  • @Kent good spot, slipped through as I was testing and updating. Also added a nice graph! – Chris Seymour Mar 26 '13 at 11:45
  • `j<=100,000,000` looks to be invalid Bash syntax – Roel Van de Paar Aug 29 '20 at 02:44
5

If you are really just getting the very first line and reading hundreds of files, then consider shell builtins instead of external external commands, use read which is a shell builtin for bash and ksh. This eliminates the overhead of process creation with awk, sed, head, etc.

The other issue is doing timed performance analysis on I/O. The first time you open and then read a file, file data is probably not cached in memory. However, if you try a second command on the same file again, the data as well as the inode have been cached, so the timed results are may be faster, pretty much regardless of the command you use. Plus, inodes can stay cached practically forever. They do on Solaris for example. Or anyway, several days.

For example, linux caches everything and the kitchen sink, which is a good performance attribute. But it makes benchmarking problematic if you are not aware of the issue.

All of this caching effect "interference" is both OS and hardware dependent.

So - pick one file, read it with a command. Now it is cached. Run the same test command several dozen times, this is sampling the effect of the command and child process creation, not your I/O hardware.

this is sed vs read for 10 iterations of getting the first line of the same file, after read the file once:

sed: sed '1{p;q}' uopgenl20121216.lis

real    0m0.917s
user    0m0.258s
sys     0m0.492s

read: read foo < uopgenl20121216.lis ; export foo; echo "$foo"

real    0m0.017s
user    0m0.000s
sys     0m0.015s

This is clearly contrived, but does show the difference between builtin performance vs using a command.

jim mcnamara
  • 16,005
  • 2
  • 34
  • 51
5

If you want to print only 1 line (say the 20th one) from a large file you could also do:

head -20 filename | tail -1

I did a "basic" test with bash and it seems to perform better than the sed -n '1{p;q} solution above.

Test takes a large file and prints a line from somewhere in the middle (at line 10000000), repeats 100 times, each time selecting the next line. So it selects line 10000000,10000001,10000002, ... and so on till 10000099

$wc -l english
36374448 english

$time for i in {0..99}; do j=$((i+10000000));  sed -n $j'{p;q}' english >/dev/null; done;

real    1m27.207s
user    1m20.712s
sys     0m6.284s

vs.

$time for i in {0..99}; do j=$((i+10000000));  head -$j english | tail -1 >/dev/null; done;

real    1m3.796s
user    0m59.356s
sys     0m32.376s

For printing a line out of multiple files

$wc -l english*
  36374448 english
  17797377 english.1024MB
   3461885 english.200MB
  57633710 total

$time for i in english*; do sed -n '10000000{p;q}' $i >/dev/null; done; 

real    0m2.059s
user    0m1.904s
sys     0m0.144s



$time for i in english*; do head -10000000 $i | tail -1 >/dev/null; done;

real    0m1.535s
user    0m1.420s
sys     0m0.788s
dvvrt
  • 599
  • 7
  • 7
  • 1
    A single `sed` invocation is slightly faster for low line positions, like `i + 1000`. See [@roel's answer](https://stackoverflow.com/a/63643068/224132) and my comments: I can repro very similar results to yours for large line positions like 100k, and also confirm Roel's result that for shorter counts, `sed` alone is better. (And for me, on i7-6700k desktop Skylake, head|tail is even better than for you, bigger relative speedup for large n. Probably better inter-core bandwidth than the system you tested on so piping all that data costs less.) – Peter Cordes Aug 29 '20 at 03:56
3

How about avoiding pipes? Both sed and head support the filename as an argument. In this way you avoid passing by cat. I didn't measure it, but head should be faster on larger files as it stops the computation after N lines (whereas sed goes through all of them, even if it doesn't print them - unless you specify the quit option as suggested above).

Examples:

sed -n '1{p;q}' /path/to/file
head -n 1 /path/to/file

Again, I didn't test the efficiency.

Jens
  • 69,818
  • 15
  • 125
  • 179
0

I have done extensive testing, and found that, if you want every line of a file:

while IFS=$'\n' read LINE; do
  echo "$LINE"
done < your_input.txt

Is much much faster then any other (Bash based) method out there. All other methods (like sed) read the file each time, at least up to the matching line. If the file is 4 lines long, you will get: 1 -> 1,2 -> 1,2,3 -> 1,2,3,4 = 10 reads whereas the while loop just maintains a position cursor (based on IFS) so would only do 4 reads in total.

On a file with ~15k lines, the difference is phenomenal: ~25-28 seconds (sed based, extracting a specific line from each time) versus ~0-1 seconds (while...read based, reading through the file once)

The above example also shows how to set IFS in a better way to newline (with thanks to Peter from comments below), and this will hopefully fix some of the other issue seen when using while... read ... in Bash at times.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Roel Van de Paar
  • 2,111
  • 1
  • 24
  • 38
  • `echo $line` should be `echo "$line"` to avoid word-splitting. Or better, `printf "%s" "$line"` to be safe even with lines like `-e`. And yes, I think you want `(IFS=$'\n'; read line; printf "%s" "$line")`, although that forks a subshell so you might instead just use override IFS for `read` alone, if `IFS=$'\n' read line < file` works without having to save/restore the IFS shell variable. – Peter Cordes Aug 29 '20 at 03:39
  • Thank you for the input Peter! This got me to test further and I found something very interesting, which also logically makes sense. Ref above. – Roel Van de Paar Aug 29 '20 at 04:55
  • Now you're printing the whole file (except for lines like `"-e"` which will echo will eat or throw an error), so your loop can be replaced with `cat "$file"` which in turn is much faster than a `bash` read loop. This question was about extracting a *single* line, the implication being that you *don't* want it in a loop repeating for every line in order. If you do just want to run some bash commands (i.e. a different loop body) for every line of an input file or stream, yes of course you'd do this. – Peter Cordes Aug 29 '20 at 05:33
  • But it's unlikely to be the fastest way to get *just* the 100k'th line from a large file, which is what other answers are attempting to do efficiently. – Peter Cordes Aug 29 '20 at 05:34
  • Yes, that's what I said. This *is* the fastest way to process *every* line, but that's a different problem from what the question is asking (and from what the other answers are answering). They're only using repeat-loops over sed or head|tail to get times long enough to measure, not because they actually want a range of lines. Your answer belongs on [Looping through the content of a file in Bash](https://stackoverflow.com/q/1521462), except that it's already answered with a `while read` loop. (And using a safe printf instead of an unsafe echo as the body). – Peter Cordes Aug 29 '20 at 05:53
  • @PeterCordes One could add a counter to grab the specific line one want. However, if you need to process all lines from a file a single line at the time (the usual use case) then this is the fastest method by far. The OP may have also meant this with "multiple times in a loop", though it is not perfectly clear. The change here is indeed in context, but my answer proposed to look at exactly that context. If on the other hand it is 'just the xth line from a large file', then it MAY still be the fastest method as well. It makes sense as it is bash-native and avoids external tools loading. – Roel Van de Paar Aug 29 '20 at 05:55
  • If you want to propose this pure-bash loop as an efficient way to extract a single line, benchmark it *for that use case*. It's probably great for the first line, likely loses for the 100k'th line. I'd guess the break-even point for this vs. `sed` might be something like the 1000th or 10 000th line. Note that if you want a *range* of lines, `head -$start | tail -$len | while read; ... ;` will do it. – Peter Cordes Aug 29 '20 at 05:56
  • It was already done above by Chris, and it proved to be the fastest. – Roel Van de Paar Aug 29 '20 at 05:57
  • If you mean the top answer, that was for reading the *first* line only, not an arbitrary later line. Bash is generally known to be not super fast for text processing, but of course it's faster for just the first line vs. forking `head -1` for it. (Note that you don't need tail for that case) – Peter Cordes Aug 29 '20 at 05:59
0

For the sake of completeness you can also use the basic linux command cut:

cut -d $'\n' -f <linenumber> <filename>
abu_bua
  • 1,361
  • 17
  • 25