1

I have a very large (multiple gigabytes) file that I want to do simple operations on:

  • Add 5-10 lines in the end of the file.
  • Add 2-3 lines in the beginning of the file.
  • Delete a few lines in the beginning, up to a certain substring. Specifically, I need to traverse the file up to a line that says "delete me!\n" and then delete all lines in the file up to and including that line.

I'm struggling to find a tool that can do the editing in place, without creating a temporary file (very long task) that has essentially a copy of my original file. Basically, I want to minimize the number of I/O operations against the disk.

Both sed -i, and awk -i, do exactly that slow thing (https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands) and are inefficient as a result. What's a better way?

I'm on Debian.

Alex Weinstein
  • 9,823
  • 9
  • 42
  • 59
  • 1
    the `echo "blabla" >> bigfile` adds line to the end. Deleting from the beginning isn't as easy. The easiest way is using temp-file, e.g. `-i` or `sed '....' file >newfile && mv newfile file` – clt60 Apr 27 '17 at 18:23
  • *Delete a few lines in the beginning, up to a certain substring* - can you elaborate? what substring and how many lines? – RomanPerekhrest Apr 27 '17 at 18:23
  • @RomanPerekhrest added explanation about it. – Alex Weinstein Apr 27 '17 at 18:30
  • 1) should `delete me!` be present as separated words OR it can be like `undelete me!` ? 2) *Add a few lines in the end* - what lines and how many? – RomanPerekhrest Apr 27 '17 at 18:33
  • 1) delete me! should be a separate line by itself. 2) Add a few lines in the end - 2-3 lines total, that say "helloAlex" "helloRoman" "helloBob" – Alex Weinstein Apr 27 '17 at 18:40
  • Only `ed` does not use a temp file and it reads your whole file into a buffer first so I doubt it's what you want given your multi-gig files. The tool you want does not exist as you'd like it to but see http://stackoverflow.com/a/17331179/1745001 for how to remove lines from the start of a file. – Ed Morton Apr 27 '17 at 19:51
  • 1
    @AlexWeinstein, what you're asking for is for the most part literally impossible. Standard UNIX syscalls -- the interface used for userspace applications to request filesystem operations -- allow in-place appends to the **end** of a file; allow in-place edits where the original and new values are of the exact same length; but **don't** let you append data or delete data (in a way that changes overall file length) at any point but the end in a way that doesn't require rewriting the entire rest of the file. – Charles Duffy Apr 27 '17 at 19:59
  • ...I say "for the most part" because there are filesystems that go beyond the standard syscalls, but an answer that only works with a very specific filesystem is a pretty darned specific answer. – Charles Duffy Apr 27 '17 at 19:59
  • ...so, it's not just "can't do this with GNU tools", it's "no UNIX application using only standard APIs can do this *at all*, no matter what language it's written in". – Charles Duffy Apr 27 '17 at 20:00
  • Related: http://stackoverflow.com/questions/9033060/c-function-to-insert-text-at-particular-location-in-file-without-over-writing-th – Charles Duffy Apr 27 '17 at 20:06
  • 1
    Incidentally, this is a class of problem that's typically solved with indexed, log-structured file formats having deletion flags and the like. Which is to say, with databases. – Charles Duffy Apr 28 '17 at 01:10

2 Answers2

4

Adding 5-10 lines at the beginning of a multi-GB file will always require fully rewriting the contents of that file, unless you're using an OS and filesystem that provides nonstandard syscalls. (You can avoid needing multiple GB of temporary space by writing back to a point in the file you're modifying from which you've already read to a buffer, but you can't avoid needing to rewrite everything past the point of the edit).

This is because UNIX only permits adding new contents to a file in a manner that changes its overall size at or past its existing end. You can edit part of a file in-place -- that is to say, you can seek 1GB in and write 1MB of new contents -- but this changes the 1MB of contents that had previously been in that location; it doesn't change the total size of the file. Similarly, you can truncate and rewrite a file at a location of your choice, but everything past the point of truncation needs to be rewritten.


An example of the nonstandard operations referred to above is the FALLOC_FL_INSERT_RANGE and FALLOC_FL_COLLAPSE_RANGE operations, which with very new Linux kernels will allow blocks to be inserted to or removed from an existing file. This is unlikely to be helpful to you here:

  • Only exact blocks (ie. 4kb -- whatever your filesystem is formatted for) can be inserted, not individual lines of text of arbitrary size.
  • Only XFS and ext4 are supported.

See the documentation for fallocate(2).

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
1

here is a recommendation for editing large files (change the lines and number of digits based on your file length and number of sections to work on)

split -l 1000 -a 4 -d bigfile bigfile_

for that you need space, since bigfile won't be removed

insert header as first line

sed -i '1iheader` bigfile_000

search a specific pattern, get the file name and remove the previous sections.

grep pattern bigfile_*

etc.

Once all editing is done, just cat back the remaining pieces

cat bigfile_* > edited_bigfile
karakfa
  • 66,216
  • 7
  • 41
  • 56
  • Very interesting direction... Follow-up question: is there a way to make the "split" command do no I/O on the content of the file? That is, to split the file in place? – Alex Weinstein Apr 27 '17 at 18:56
  • Nope, it has the locate the lines. – karakfa Apr 27 '17 at 19:10
  • @AlexWeinstein: If you can find a byte chunk size that is still small enough while not interfering with your string search, you can use `-b ` instead of `-l `, though I'm not sure if / to what extent that helps in terms of I/O. – mklement0 Apr 27 '17 at 19:37
  • 1
    @karakfa: Can you add an explanation to your answer as to why this approach helps with large files? It is not obvious to me, given that you first effectively create a copy of the entire original file, albeit in chunks. – mklement0 Apr 27 '17 at 19:40
  • 1
    This will help editing the various pieces especially if the tasks are performed iterative. This eliminates scanning the file multiple times for each local edit. Otherwise, there is no magic.... – karakfa Apr 27 '17 at 19:42