-2

I have some large text files (>1Gb) and need to replace or delete some lines. I need to be able to replace one randomly choosen line with another, delete it or insert one line after another.

I have tried counting the number of lines using getline(file, line) - which takes too much time. This also leads to a long amount of time to reach a line by it's number.

Is there a more efficient or better way to do this?

kaz
  • 1,190
  • 8
  • 19
gravicapa
  • 1
  • 3
  • In such a specialized case, I wouldn't trust in library functions. Get character by character and utilize memory as needed. Surely you cannot get whole line in memory (multiple GB) you need some tricks depends on your problem specification. Tell us more about your objective and what you have done so far. – Halil Jun 05 '15 at 19:47
  • Is the number of characters per line guaranteed to be the same? If so, then the task becomes much easier to search for a particular line without using `getline`. The issue would be when it comes time to put the file back together with the missing lines. I don't think you can do that in a quick manner. – PaulMcKenzie Jun 05 '15 at 19:53
  • The number of characters are not the same, just a random file can be. And the number of lines is not known neither. So the main problem is to exchange two random lines or to delete one line from the file. I haven't done much so far, only tried to reach the last line and count the total number of lines using "getline". – gravicapa Jun 05 '15 at 20:04
  • Well you can use a memory map file as the answer suggested. The other option if you can't use memory mapped file is to go hard-core. Read the text file in binary mode, maybe a few megabytes at a time. Each megabyte, search for the cr-lf and keep a count until you reach the line. Use the C-style string search functions such as `strchr`. My personal experience -- doing things this way, at least with the hardware I was working with, found a line in a multi gigabyte text file in mere seconds, as opposed to calling `getline` repeatedly. Again, YMMV depending on the hardware, disk access, etc. – PaulMcKenzie Jun 05 '15 at 20:15

4 Answers4

2

You have to take into account that in order to change one line you need to read the whole file, search for the line, change it then write the whole file. It is bad yes, but files are sequential in nature and you can't get to a place without visiting all the previous ones.

Omar Zaarour
  • 302
  • 3
  • 9
2

You basically have two options, depending on your actual problem:

1) If you perform this operation repeatedly for your files, you can somewhat optimize it by using more advanced data structures. Basically you are not storing flat text files anymore but collections of lines. This could be achieved by adding a header with offsets for each line, an additional delta-file that contains all changes (which has to, of course, be considered while reading) that is only applied when it starts to grow to big or your operation is finished, or even keeping all the lines in a more traditional DBMS.

2) If this operation is only performed rarely per file, you might wish to optimize your read routine a bit. You probably have the best chance by mmaping the whole file and scanning it for EOL yourself, as you can get rid of a whole bunch of memory allocations / string copies this way. While mmap obviously causes memory pressure in the background, I have found this technique to be fairly fast in practice and very easy to implement.

danielschemmel
  • 10,885
  • 1
  • 36
  • 58
  • The link for `mmap` is for Linux, is it exist for windows? – gravicapa Jun 05 '15 at 20:10
  • I think you should search before you ask. http://stackoverflow.com/questions/4087282/is-there-a-memory-mapping-api-on-windows-platform-just-like-mmap-on-linux – Halil Jun 05 '15 at 20:18
  • for option 1, pair the text file with a separate index file. Your index file has a header, then fixed length records describing each line (offset & index). The first time you work on the file you still have to parse each text line (using the memory mapped file), but you write the index file as you do that so you never have to parse the lines again. Keep both in sync if you do an operation, of course. This is common with a blob file containing multiple small binary files (graphics) but would work with text lines. – Kenny Ostrom Jun 05 '15 at 20:50
0

I think that you could use a database to keep each line to it. Each row of your table could have two columns (ID and line). After filing the database you could ask the database for a line with random ID.

  • I don't think duplicating multiple GB data into a DB is a good approach for this task. Then he should maintain everything on DB, not on files. – Halil Jun 05 '15 at 19:50
  • For this reason I think it is better to maintain everything on DB. He could easily change each line that he wants to. Ofcourse he must delete text file and keep only db – Dimitris Dimitriadis Jun 06 '15 at 09:59
0

I believe, the best option in your case is using memory-mapped files. CreateFileMapping and MapViewOfFile will help you with this.

Below is a simple example of working with memory mapped files in Windows. I omitted error checking code and simplified the task to replacing the random string's characters with '?'. This was done intentionally to outline the main aspects only:

1) Create a memory-mapped file (CreateFile -> CreateFileMapping-> MapViewOfFile)

2) Process it as it was just a memory block. Yes, you would have to scan if from the beginning if you need to process in line mode but mmapping it would be the fastest approach as of my belief.

Important note: the real life task may require the original file to reduce or expand. It's quite easy to reduce it -- just memmove the respective portion back and truncate the file on close. Expanding would require more tricky technics.

3) Don't forget to release the resources with UnmapViewOfFile/CloseHandle.

#include "stdafx.h"
#include <Windows.h>

int _tmain(int argc, _TCHAR* argv[])
{
    TCHAR fileName[] = _T("SomeHugeFile.dat");
    HANDLE hFile = CreateFile(
        fileName,
        GENERIC_READ | GENERIC_WRITE,
        0,
        NULL,
        OPEN_EXISTING,
        FILE_ATTRIBUTE_NORMAL,
        NULL
        );
    if (!hFile) {
        // Process the error
    }
    DWORD fileSize = GetFileSize(hFile, NULL);
    if (!fileSize) {
        // Check if it's an actual errro with GetLastError
        // and process accordingly if so
    }
    HANDLE hMemMappedFile = CreateFileMapping(
        hFile,
        NULL,
        PAGE_READWRITE,
        0,
        0,
        NULL
        );
    if (!hMemMappedFile) {
        // Process the error
    }
    LPVOID mappedMemory = MapViewOfFile(
        hMemMappedFile,
        FILE_MAP_ALL_ACCESS,
        0,
        0,
        0   // To the end of the file
        );


    DWORD lineToDelete = 3; // Some random line number
    // Assuming the file is ASCII and with Unix line endings
    char *mappedMemoryStart = (char *)mappedMemory;
    char *mappedMemoryEnd = mappedMemoryStart + fileSize;
    char *lineStart = NULL;
    char *lineEnd = NULL;

    // Find the line start:
    DWORD lineNumber = 0;
    for (lineStart = (char *)mappedMemory; lineStart < mappedMemoryEnd; lineStart++) {
        if (*lineStart == '\n') {
            lineNumber++;
            if (lineNumber == lineToDelete) {
                lineStart++;
                break;
            }
        }
    }
    if (lineStart >= mappedMemoryEnd) {
        // Error: has gone beyond file end
    }
    for (lineEnd = lineStart; lineEnd < mappedMemoryEnd; lineEnd++) {
        if (*lineEnd == '\n') {
            break;
        }
    }
    // Now mangle the found line:
    while (lineStart < lineEnd) {
        *lineStart = '?';
        lineStart++;
    }

    UnmapViewOfFile(mappedMemory);
    CloseHandle(hMemMappedFile);
    CloseHandle(hFile);
    return 0;
}
Dmitry Egorov
  • 9,542
  • 3
  • 22
  • 40