11

I'm trying to remove the first 37 lines from a very, very large file. I started trying sed and awk, but they seem to require copying the data to a new file. I'm looking for a "remove lines in place" method, that unlike sed -i is not making copies of any kind, but rather is just removing lines from the existing file.

Here's what I've done...

awk 'NR > 37' file.xml > 'f2.xml'
sed -i '1,37d' file.xml

Both of these seem to do a full copy. Is there any other simple CLI that can do this quickly without a full document traversal?

Mittenchops
  • 18,633
  • 33
  • 128
  • 246
  • 1
    Both `sed -i` and `gawk v4.1 -i -inplace` options are basically creating temp file behind the scenes. IMO `sed` should be the faster than `tail` and `awk`. – jaypal singh Jun 26 '13 at 21:04
  • `tail` is multiple times faster for this task, than `sed` or `awk`. (of course doesn't fit for this question for real inplace) – thanasisp Sep 22 '20 at 21:30

4 Answers4

14

There's no simple way to do inplace editing using UNIX utilities, but here's one inplace file modification solution that you might be able to modify to work for you (courtesy of Robert Bonomi at https://groups.google.com/forum/#!topic/comp.unix.shell/5PRRZIP0v64):

bytes=$(head -37 "$file" |wc -c)
dd if="$file" bs="$bytes" skip=1 conv=notrunc of="$file"

The final file should be $bytes bytes smaller than the original (since the goal was to remove $bytes bytes from the beginning), so to finish we must remove the final $bytes bytes. We're using conv=notrunc above to make sure that the file doesn't get completely emptied rather than just truncated (see below for example). On a GNU system such as Linux doing the truncation afterwards can be accomplished by:

truncate -s "-$bytes" "$file"

For example to delete the first 5 lines from this 12-line file

$ wc -l file
12 file

$ cat file
When chapman billies leave the street,
And drouthy neibors, neibors, meet;
As market days are wearing late,
And folk begin to tak the gate,
While we sit bousing at the nappy,
An' getting fou and unco happy,
We think na on the lang Scots miles,
The mosses, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.

First use dd to remove the target 5 lines (really "$bytes" bytes) from the start of the file and copy the rest from the end to the front but leave the trailing "$bytes" bytes as-is:

$ bytes=$(head -5 file |wc -c)

$ dd if=file bs="$bytes" skip=1 conv=notrunc of=file
1+1 records in
1+1 records out
253 bytes copied, 0.0038458 s, 65.8 kB/s

$ wc -l file
12 file

$ cat file
An' getting fou and unco happy,
We think na on the lang Scots miles,
The mosses, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.
s, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.

and then use truncate to remove those leftover bytes from the end:

$ truncate -s "-$bytes" "file"

$ wc -l file
7 file

$ cat file
An' getting fou and unco happy,
We think na on the lang Scots miles,
The mosses, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.

If we had tried the above without dd ... conv=notrunc:

$ wc -l file
12 file
$ bytes=$(head -5 file |wc -c)
$ dd if=file bs="$bytes" skip=1 of=file
dd: file: cannot skip to specified offset
0+0 records in
0+0 records out
0 bytes copied, 0.0042254 s, 0.0 kB/s
$ wc -l file
0 file

See the google groups thread I referenced for other suggestions and info.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 3
    On linux, you'll want to use `conv=notrunc` in `dd`, otherwise the command fails. `+1`. – gniourf_gniourf Jun 26 '13 at 22:23
  • I think this may have worked in that my file looked OK, but it seemed to also do additional writing that I terminated. So, I wrote a file named cutter.sh, which contained ```#!/bin/bash file=enwiki-latest-pages-articles.xml count=`head -37 "$file" |wc -c` dd if="$file" bs="$count" skip=1 of="$file" conv=notrunc``` – Mittenchops Jun 27 '13 at 00:00
  • It ran for a very long time, then when I C-c to start over again, ended with: `^C1223734+0 records in 1223734+0 records out 2902697048 bytes (2.9 GB) copied, 59.699 s, 48.6 MB/s ` However, my data /looks/ fine. Can I trust its integrity from having cut it off? It doesn't seem like 2.9GB needed to be copied for 37 short lines of data. – Mittenchops Jun 27 '13 at 00:01
  • Best go ask at the comp.unix.shell newsgroup where all the shell experts hang out. – Ed Morton Jun 27 '13 at 01:59
  • 2
    This is mentioned in the Google Groups thread, but never spelled out: you have to trim the final `$count` bytes from the end of the file when you're done. I've edited your answer to reflect this so future readers have a complete solution. – jasonmp85 Jan 27 '14 at 07:46
  • `truncate` removes **last** n bytes if specified size is smaller then actual file size, while the question is about **first** n lines – vladkras Nov 14 '18 at 04:29
  • @vladkras right. `dd` removes the first N bytes but leaves N bytes worth of unwanted text at the end of the file so then we use `truncate` to remove that unwanted trailing text. I just updated the question to show a complete example. – Ed Morton May 14 '19 at 04:18
6

Unix file semantics do not allow truncating the front part of a file.

All solutions will be based on either:

  1. Reading the file into memory and then writing it back (ed, ex, other editors). This should be fine if your file is <1GB or if you have plenty of RAM.
  2. Writing a second copy and optionally replacing the original (sed -i, awk/tail > foo). This is fine as long as you have enough free diskspace for a copy, and don't mind the wait.

If the file is too large for any of these to work for you, you may be able to work around it depending on what's reading your file.

Perhaps your reader skips comments or blank lines? If so, you can then craft a message the reader ignores, make sure it has the same number of bytes as the 37 first lines in your file, and overwrite the start of the file with dd if=yourdata of=file conv=notrunc.

that other guy
  • 116,971
  • 11
  • 170
  • 194
  • 2
    Hmm, hadn't thought of that. If I were to do this at the time of bunzip2-ing the file---you're saying I would pipe the unzip to awk and that to the outfile? So, would that be something like `bunzip2 filename.xml.bz2 | awk 'NR > 37' filename.xml` – Mittenchops Jun 26 '13 at 21:30
  • 1
    yep, doing that when unzipping would also just stream the copy and write to disk only the altered file. – Peteris Jun 26 '13 at 23:03
5

is the standard editor:

ed -s file <<< $'1,37d\nwq'
gniourf_gniourf
  • 44,650
  • 9
  • 93
  • 104
  • That is using a buffer, no better than a temp file. – Ed Morton Jun 26 '13 at 21:11
  • 2
    +1 this was fast. file with 1m entries - `$ time ed -s ff <<< $'1,37d\nwq' real 0m0.251s user 0m0.219s sys 0m0.032s $ time sed -i '1,37d' ff real 0m1.415s user 0m0.399s sys 0m1.016s` – jaypal singh Jun 26 '13 at 21:12
  • @EdMorton of course, that's what editor does `:)` yet, it might be faster than [tag:sed] or [tag:awk]... – gniourf_gniourf Jun 26 '13 at 21:15
  • I'm about to test this after my file re-unzips. The record is about 9GB, so I hope it's not buffering. =) – Mittenchops Jun 26 '13 at 21:32
  • 1
    The OP seems to be looking for the solution that truly does in-place editing though, not one that uses a temp file/buffer otherwise he may as well just use sed or awk. – Ed Morton Jun 26 '13 at 21:46
2

The copy will have to be created at some point - why not at the time of reading the "modified" file; streaming the altered copy instead of storing it?

What I'm thinking - create a named pipe "file2" that is the output of that same awk 'NR > 37' file.xml or whatever; then whoever reads file2 will not see the first 37 lines.

The drawback is that it will run awk each time the file is processed, so it's feasible only if it's read rarely.

Peteris
  • 3,281
  • 2
  • 25
  • 40