6

currently I am using sed to print the required portion of the file. For example, I used the below command

sed -n 89001,89009p file.xyz

However, it is pretty slow as the file size is increasing (my file is currently 6.8 GB). I have tried to follow this link and used the command

sed -n '89001,89009{p;q}' file.xyz

But, this command is only printing the 89001th line. Kindly, help me.

Community
  • 1
  • 1
  • Have you considered reducing the size of the file, by splitting it into more manageable chunks (maybe 1 GiB each, or maybe 10,000 lines each, or something like that)? As the file grows and you need to select lines nearer the end, the time take to process the file will grow. If you select lines 1-10, then 11-20, then 21-30, etc, then you have a quadratic process, which is never going to be good for performance. – Jonathan Leffler Aug 27 '16 at 03:40
  • 1
    @mklement0 : Yep, you're right and thanks for that explanation of why it can't work. As there are 8 answers now on this Q , removing my incorrect comment ;-/ Good luck to all. – shellter Aug 29 '16 at 00:29

6 Answers6

8

The syntax is a little bit different:

sed -n '89001,89009p;89009q' file.xyz

UPDATE:

Since there is also an answer with awk I made small comparison and as I thought - sed is a little bit faster:

$ wc -l large-file 
100000000 large-file
$ du -h large-file 
954M    large-file
$ time sed -n '890000,890010p;890010q' large-file > /dev/null

real    0m0.141s
user    0m0.068s
sys 0m0.000s
$ time awk 'NR>=890000{print} NR==890010{exit}' large-file > /dev/null

real    0m0.433s
user    0m0.208s
sys 0m0.008s`

UPDATE2:

There is a faster way with awk as posted by @EdMorton but still not as fast as sed:

$ time awk 'NR>=890000{print; if (NR==890010) exit}' large-file > /dev/null

real    0m0.252s
user    0m0.172s
sys     0m0.008s

UPDATE:

This is the fastest way I was able to find (head and tail):

$ time head -890010 large-file| tail -10 > /dev/null

real    0m0.085s
user    0m0.024s
sys     0m0.016s
Dave Grabowski
  • 1,619
  • 10
  • 19
  • 2
    You could use the second number, 89009, twice — which is simpler than having to add one to the second number of the range. That is, `sed -n "${line1},${line2}p; ${line2}q"` would work nicely where `line1=89001` and `line2=89009`, probably selected from command line arguments. – Jonathan Leffler Aug 27 '16 at 03:32
  • Dear @JonathanLeffler, as you said, as I am trying to access the last parts of the file, it is taking a lot of time. So, other than splitting the file (maybe using a command like split), is there any other efficient manner (for example, using any other commands)? – Sharma SRK Chaitanya Yamijala Aug 27 '16 at 04:46
  • 1
    @SharmaSRKChaitanyaYamijala: There's an element of [XY Problem](http://mywiki.wooledge.org/XyProblem) here — we know how you're trying to resolve a problem, but we know it isn't an ideal solution. However, there is something else that you're trying to achieve. Maybe it is 'for every 10 records that arrive, create a file that can be processed and launch the processing command' — in which case, you use a continuously running process to do the splitting which can keep its current position in the file and act accordingly. This command doesn't have to reread all 6 GiB when new lines arrive. – Jonathan Leffler Aug 27 '16 at 04:53
  • @SharmaSRKChaitanyaYamijala: Note that `tail -f` might be part of the answer — or you might use code that simulates `tail -f`. Also, your specialized command could keep track of byte offsets as well as line numbers. Byte offsets could be handy if you need to resume after a stoppage. Just some thoughts… – Jonathan Leffler Aug 27 '16 at 04:58
  • 3
    @DawidGrabowski Please test the speed of `awk 'NR>=890000{print; if (NR==890010) exit}' large-file` (see [my answer](http://stackoverflow.com/a/39181709/1745001)) - that should be significantly faster than the awk script you already tested since it's not redundantly testing the second condition for the first 889999 lines, just the 10 lines starting at 890000. btw - hopefully you're reporting the 3rd execution time for each command to remove caching from the equation. – Ed Morton Aug 27 '16 at 13:42
  • @EdMorton That's better but still slower than sed – Dave Grabowski Aug 27 '16 at 22:25
  • 1
    When I ran the same test I got sed `real 0m0.265s; user 0m0.233s; sys 0m0.015s` and awk `real 0m0.234s; user 0m0.187s; sys 0m0.030s` so in my case awk is slightly faster. It probably just depends on the versions of sed and awk you are running. – Ed Morton Aug 28 '16 at 00:30
  • @EdMorton I posted another answer with head and tail. Can you try try with that? – Dave Grabowski Aug 28 '16 at 21:41
  • You're only running `time` on the `head` command, you need to do `time sh -c 'head -890010 large-file| tail -10' > /dev/null` to time the full pipeline. The output I get is `real 0m0.281s; user 0m0.138s; sys 0m0.077s` so a bit slower than sed or awk. – Ed Morton Aug 29 '16 at 03:10
4
awk 'NR>=89001{print; if (NR==89009) exit}' file.xyz
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • ++; You've explained why your answer performs better than @karakfa's in the comments on the accepted question, but I suggest doing so directly in your answer as well. – mklement0 Aug 27 '16 at 15:56
  • @mklement0 This is still slower than sed. – Dave Grabowski Aug 27 '16 at 22:55
  • @DawidGrabowski: That probably depends on the specific `sed` and `awk` implementations you're using for the comparison (for instance, on my OSX 10.11 system, both `gawk` and `mawk` are faster than my BSD `sed`), but among _Awk_ solutions, this optimization is well worth making. – mklement0 Aug 27 '16 at 23:04
  • @mklement0 I posted another answer with head and tail. Can you try that one on your machine? It's faster than sed and awk for me. – Dave Grabowski Aug 28 '16 at 21:40
3

Dawid Grabowski's helpful answer is the way to go (with sed[1] ; Ed Morton's helpful answer is a viable awk alternative; a tail+head combination will typically be the fastest[2]).

As for why your approach didn't work:

A two-address expression such as 89001,89009 selects an inclusive range of lines, bounded by the start and end address (line numbers, in this case).

The associated function list, {p;q;}, is then executed for each line in the selected range.

Thus, line # 89001 is the 1st line that causes the function list to be executed: right after printing (p) the line, function q is executed - which quits execution right away, without processing any further lines.

To prevent premature quitting, Dawid's answer therefore separates the aspect of printing (p) all lines in the range from quitting (q) processing, using two commands separated with ;:

  • 89001,89009p prints all lines in the range
  • 89009q quits processing when the range's end point is reached

[1] A slightly less repetitive reformulation that should perform equally well ($ represents the last line, which is never reached due to the 2nd command):
sed -n '89001,$ p; 89009 q'

[2] A better reformulation of the head + tail solution from Dawid's answer is
tail -n +89001 file | head -n 9, because it caps the number of bytes that are not of interest yet are still sent through the pipe at the pipe-buffer size (a typical pipe-buffer size is 64 KB).
With GNU utilities (Linux), this is the fastest solution, but on OSX with stock utilities (BSD), the sed solution is fastest.

Community
  • 1
  • 1
mklement0
  • 382,024
  • 64
  • 607
  • 775
2

easier to read in awk, performance should be similar to sed

awk 'NR>=89001{print} NR==89009{exit}' file.xyz

you can replace {print} with semicolon as well.

karakfa
  • 66,216
  • 7
  • 41
  • 56
  • Yes, interesting! Since there is no parsing I would have expected the underlying code paths should be very similar. With sed start,end I guess it skips the lines until start, awk needs to compare each line starting at 1. – karakfa Aug 27 '16 at 04:55
0

Another way to do it will be using combination of head and tail:

$ time head -890010 large-file| tail -10 > /dev/null

real    0m0.085s
user    0m0.024s
sys     0m0.016s

This is faster than sed and awk.

mklement0
  • 382,024
  • 64
  • 607
  • 775
Dave Grabowski
  • 1,619
  • 10
  • 19
  • I suggest `tail -n +890001 large-file | head -n 10` instead, which not only makes conceptually more sense to me, but also seems to be faster with higher line numbers. Aside from that: on OSX, with stock utilities (BSD heritage), `sed` (but not `awk`) outperforms the `tail` + `head` combo, but with _GNU_ utilities (at least when run on OSX), `tail` + `head` is by far the fastest solution, beating `mawk`, `gawk` and `sed` by a wide margin, with `mawk` noticeably faster than `gawk`, followed closely by `sed`. Based on 1-million lines file with `line ` lines, extracting 10 lines from #890001. – mklement0 Aug 28 '16 at 23:51
  • You're only running `time` on the `head` command, you need to do `time sh -c 'head -890010 large-file| tail -10' > /dev/null` to time the full pipeline. The output I get is `real 0m0.281s; user 0m0.138s; sys 0m0.077s` so a bit slower than sed or awk. @mklement0 is right, reversing the order of the commands will be faster, depending on whether you want to select from before or after the middle of the file. – Ed Morton Aug 29 '16 at 03:12
  • @EdMorton: `time` actually _does_ time full pipelines (in `bash`, `ksh`, `zsh`; there, `time` is a shell _keyword_, not a _builtin_, which allows it to act that way - see http://mywiki.wooledge.org/BashFAQ/032); your `time sh -c '...'` workaround is only needed in shells that do not provide a built-in `time`, such as in `dash`. – mklement0 Aug 29 '16 at 15:23
  • Interesting, I get a significant difference when running the 2 commands lines with bash on cygwin, IMHO not explainable by the overhead of `sh -c`. Oh well... thanks for the info. – Ed Morton Aug 29 '16 at 16:11
  • @EdMorton: Re preferring `tail -n +... | head ...`: It's not about the _midpoint_ of the input file, but the size of the _pipe buffer_: `tail -n +...` sends at most a pipe buffer-sized chunk of data through the pipe before `head` acts on it, whereas the `head`-first approach unconditionally sends the specified number of lines - whatever their combined size - given that `tail` must read them all before it can determine what the _last_ N lines are. `head ... | tail ...` is only faster if the count of bytes before the first line of interest is smaller than the pipe buffer (typically 64 KB). – mklement0 Aug 29 '16 at 21:02
-2

It requires sed to search from the beginning of the file to find the N'th line. To make things faster, divide the large file at fixed number of lines intervals using and index file. Then use dd to skip early portions of the big file before feeding to sed.

Build the index file using:

#!/bin/bash

INTERVAL=1000
LARGE_FILE="big-many-GB-file"
INDEX_FILE="index"

LASTSTONE=123
MILESTONE=0

echo $MILESTONE > $INDEX_FILE

while [ $MILESTONE != $LASTSTONE ] ;do

LASTSTONE=$MILESTONE
MILESTONE=$(dd if="$LARGE_FILE" bs=1 skip=$LASTSTONE 2>/dev/null |head -n$INTERVAL |wc -c)
MILESTONE=$(($LASTSTONE+$MILESTONE))
echo $MILESTONE >> $INDEX_FILE
done

exit

Then search for a line using: ./this_script.sh 89001

#!/bin/bash

INTERVAL=1000
LARGE_FILE="big-many-GB-file"
INDEX_FILE="index"

LN=$(($1-1))

OFFSET=$(head -n$((1+($LN/$INTERVAL))) $INDEX_FILE |tail -n1)
LN=$(($LN-(($LN/$INTERVAL)*$INTERVAL)))
LN=$(($LN+1))
dd if="$LARGE_FILE" bs=1 skip=$OFFSET 2>/dev/null |sed -n "$LN"p
ronybc
  • 15
  • 6
  • my first thing at http://stackoverflow.com (registered today). Didn't tried this on a GB weighing text file. Tested and confirmed that 'dd' skips without reading data. – ronybc Aug 27 '16 at 16:26
  • 1
    dd skips number of bytes. You cannot skip number of lines so your code will not work. – Dave Grabowski Aug 28 '16 at 00:18
  • dd skips some number of bytes and gives the rest to sed. – ronybc Aug 28 '16 at 04:21
  • 1
    But you don't know how many bytes to skip if you want to start from e.g. 89000 line. – Dave Grabowski Aug 28 '16 at 21:38
  • it puts milesstones between each 10000 line with an index file. – ronybc Aug 29 '16 at 17:20
  • and that will SURE speed up this, if u are troubled with a MANY-GB-TEXT file. Make a 50 GB text file and search for the last line. time it with an hourglass..! WTF – ronybc Aug 29 '16 at 17:25
  • The first script is to mark between 10000 lines (where it ends byte-wise) into a table. So that, the second script, can forward u to the bench where the boy that you are looking for, is damn sitting. – ronybc Aug 29 '16 at 17:36
  • 1
    Last line? Just ran tail -n1. It will be done almost in 0s. Tail is not reading the entire fail - it starts backwards. BTW have you ever tried to run your code with a bigger file? Creating an index file will take forever for sure. – Dave Grabowski Aug 29 '16 at 17:38
  • it starts backwards... towards where..? – ronybc Aug 29 '16 at 19:03
  • 1
    It goes until the expected number of lines has reached. [Here](http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/tail.c#n471) is the implementation. – Dave Grabowski Aug 29 '16 at 20:42
  • looks like i failed to explain the idea... requests you to read the original post again. dd head n tail are well established tools... looks like we are looking at some different directions. and i'm so fed up with it. sorry bro... let's stop it here. cheers... – ronybc Aug 29 '16 at 23:00
  • if there is repeating searches over the same BIG TEXT FILE. (sure, there will be) this is to get arround the Many GBs quick... a short cut... using dd, head. tail.. and sed at last. to save time. – ronybc Aug 29 '16 at 23:24