21

Possible Duplicate:
Removing the first line of a text file in C#

What would be the fastest and smartest way to remove the first line from a huge (think 2-3 GB) file?

  • I think, that you probably can't avoid rewriting the whole file chunk-by-chunk, but I might be wrong.

  • Could using memory-mapped files somehow help to solve this issue?

  • Is it possible to achieve this behavior by operating directly on the file system (NTFS, for example) - say, update the corresponding inode data and change the file starting sector, so that the first line is ignored? If yes, would this approach be really fragile or there are many other applications, except the OS itself that do something similiar?

Community
  • 1
  • 1
Yippie-Ki-Yay
  • 22,026
  • 26
  • 90
  • 148

5 Answers5

13

NTFS by default on most volumes (but importantly not all!) stores data in 4096 byte chunks. These are referenced by the $MFT record, which you cannot edit directly because it's disallowed by the Operating System (for reasons of sanity). As a result, there is no trick available to operate on the filesystem to do something approaching what you want (in other words, you cannot directly reverse truncate a file on NTFS, even in filesystem chunk sized amounts.)

Because of the way files are stored in a filesystem, the only answer is that you must rewrite the entire file directly. Or figure out a different way to store your data. a 2-3GB file is massive and crazy, especially considering you referred to lines meaning that this data is at least in part text information.

You should look into putting this data into a database perhaps? Or organizing it a bit more efficiently at the very least.

OmnipotentEntity
  • 16,531
  • 6
  • 62
  • 96
  • 1
    [You could use a sparse file.](http://blogs.msdn.com/b/oldnewthing/archive/2010/12/01/10097859.aspx) – Joey Jul 19 '12 at 21:39
8

You can overwrite every character that you want to erase with '\x7f'. Then, when reading in the file, your reader ignores that character. This assumes you have a text file that doesn't ever use the DEL character, of course.

std::istream &
my_getline (std::istream &in, std::string &s,
            char del = '\x7f', char delim = '\n') {
    std::getline(in, s, delim);
    std::size_t beg = s.find(del);
    while (beg != s.npos) {
        std::size_t end = s.find_first_not_of(del, beg+1);
        s.erase(beg, end-beg);
        beg = s.find(del, beg+1);
    }
    return in;
}

As Henk points out, you could choose a different character to act as your DELETE. But, the advantage is that the technique works no matter which line you want to remove (it is not limited to the first line), and doesn't require futzing with the file system.

Using the modified reader, you can periodically "defragment" the file. Or, the defragmentation may occur naturally as the contents are streamed/merged into a different file or archived to a different machine.

Edit: You don't explicitly say it, but I am guessing this is for some kind of logging application, where the goal is to put an upper bound on the size of the log file. However, if that is the goal, it is much easier to just use a collection of smaller log files. Let's say you maintained roughly 10MB log files, with total logs bounded to 4GB. So that would be about 400 files. If the 401st file is started, for each line written there, you could use the DELETE marker on successive lines in the first file. When all lines have been marked for deletion, the file itself can be deleted, leaving you with about 400 files again. There is no hidden O(n2) behavior so long as the first file is not closed while the lines are being deleted.

But easier still is allow your logging system to keep the 1st and 401st file as is, and remove the 1st file when moving to the 402nd file.

jxh
  • 69,070
  • 8
  • 110
  • 193
  • 1
    Yes, clever idea. Alternatively overwrite with spaces, newlines or `\0`s. It all depends on the reader though, and how much it can be adapted. – H H Jul 19 '12 at 20:00
  • @HenkHolterman: You're right. I updated the post to reflect that a different character could be chosen. Regards – jxh Jul 19 '12 at 20:33
6

Even if you could remove a leading block it would at least be a sector (512 bytes), probably not a match to the size of your line.

Consider a wrapper (maybe even a helper file) to just start reading from a certain offset.

H H
  • 263,252
  • 30
  • 330
  • 514
3

Idea (no magic dust, only hard work below):

use user-mode file system such as http://www.eldos.com/cbfs/ or http://dokan-dev.net/en/ to WRAP around your real filesystem, and create a small book-keeping system to track how many of the file is 'eaten' at front. At certain time, when file grows too big, rewrite the file into another and start over.

How about that?

EDIT:

if you go with virtual file system, then you can use smaller (256mb) file fragments that you can glue into one 'virtual' file with desired offset. That way you won't ever need to re-write the file.

MORE:

Reflection on the idea on 'overwriting' first few lines with 'nothing' - don't do that, instead, add one 64-bit integer to the FRONT of the file, and use any method you like to skip that many bytes, for example Stream derivation that will wrap original stream and offset the reading in it.

I guess that might be better if you choose to use wrappers on the 'client' side.

Daniel Mošmondor
  • 19,718
  • 12
  • 58
  • 99
0

Break the file in two , the first being the smaller chunk. Remove the first line and then attach with the other.

pckabeer
  • 686
  • 6
  • 27