8

I wanna read and remove the first line from a txt file (without copying, it's a huge file).
I've read the net but everybody just copies the desired content to a new file. I can't do that.

Below a first attempt. This code will be stucked in a loop as no lines are removed. If the code would remove the first line of file at each opening, the code would reach the end.

#include <iostream>
#include <string>
#include <fstream>
#include <boost/interprocess/sync/file_lock.hpp>

int main() {
    std::string line;
    std::fstream file;
    boost::interprocess::file_lock lock("test.lock");
    while (true) {
        std::cout << "locking\n";
        lock.lock();
        file.open("test.txt", std::fstream::in|std::fstream::out);
        if (!file.is_open()) {
            std::cout << "can't open file\n";
            file.close();
            lock.unlock();
            break;
        }
        else if (!std::getline(file,line)) {
            std::cout << "empty file\n"; //
            file.close();                // never
            lock.unlock();               // reached
            break;                       //
        }
        else {
            // remove first line
            file.close();
            lock.unlock();
            // do something with line
        }
    }
}
user1587451
  • 978
  • 3
  • 15
  • 30
  • 1
    Files just don't work that way (similar to a raw array, you can't remove the first element without moving all the remaining elements up one slot). – crashmstr Mar 01 '16 at 13:12
  • Every branch of that `if` statement has `file.close(); lock.unlock();`. The destructor for the `std::file` object will close the file, so you don't need to explicitly close it (and when `file.is_open()` returns false, there's no need to close it). And there is undoubtedly an RAII type in Boost for managing that lock, with a destructor to unlock it. – Pete Becker Mar 01 '16 at 13:13
  • Yeah sure, "everybody"... https://www.google.com/search?q=c%2B%2B+modify+file+in+place – Christian Hackl Mar 01 '16 at 13:13
  • I think this question is related to truncating the file at the front. One option is to move the data after the first line to the beginning of the file. But this will be costly for bigger files. Please also check this: http://stackoverflow.com/questions/706167/truncate-file-at-front – Umamahesh P Mar 01 '16 at 13:27
  • @user1587451 - How big is the file and what's the target OS, is it windows? The solution that springs immediately to mind would be to memory-map the file then simply use memmove or memcpy to shift the bytes back by the length of the first line. While this is still copying in the strictest sense, you'd be leveraging the OS to do it, which would take care of pretty much all of the heavy lifting. This can trivially be done if the file is under 4GB using C and the WindowsAPI. – enhzflep Mar 01 '16 at 15:03

2 Answers2

6

Here's a solution written in C for Windows. It will execute and finish on a 700,000 line, 245MB file in no time. (0.14 seconds)

Basically, I memory map the file, so that I can access the contents using the functions used for raw memory access. Once the file has been mapped, I just use the strchr function to find the location of one of the pair of symbols used to denote an EOL in windows (\n and \r) - this tells us how long in bytes the first line is.

From here, I just memcpy from the first byte f the second line back to the start of the memory mapped area (basically, the first byte in the file).

Once this is done, the file is unmapped, the handle to the mem-mapped file is closed and we then use the SetEndOfFile function to reduce the length of the file by the length of the first line. When we close the file, it has shrunk by this length and the first line is gone.

Having the file already in memory since I've just created and written it is obviously altering the execution time somewhat, but the windows caching mechanism is the 'culprit' here - the very same mechanism we're leveraging to make the operation complete very quickly.

The test data is the source of the program duplicated 100,000 times and saved as testInput2.txt (paste it 10 times, select all, copy, paste 10 times - replacing the original 10, for a total of 100 times - repeat until output big enough. I stopped here because more seemed to make Notepad++ a 'bit' unhappy)

Error-checking in this program is virtually non-existent and the input is expected not to be UNICODE, i.e - the input is 1 byte per character. The EOL sequence is 0x0D, 0x0A (\r, \n)

Code:

#include <stdio.h>
#include <windows.h>

void testFunc(const char inputFilename[] )
{
    int lineLength;

    HANDLE fileHandle = CreateFile(
                                    inputFilename,
                                    GENERIC_READ | GENERIC_WRITE,
                                    0,
                                    NULL,
                                    OPEN_EXISTING,
                                    FILE_ATTRIBUTE_NORMAL | FILE_FLAG_WRITE_THROUGH,
                                    NULL
                                    );

    if (fileHandle != INVALID_HANDLE_VALUE)
    {
        printf("File opened okay\n");

        DWORD fileSizeHi, fileSizeLo = GetFileSize(fileHandle, &fileSizeHi);

        HANDLE memMappedHandle = CreateFileMapping(
                                                    fileHandle,
                                                    NULL,
                                                    PAGE_READWRITE | SEC_COMMIT,
                                                    0,
                                                    0,
                                                    NULL
                                                );
        if (memMappedHandle)
        {
            printf("File mapping success\n");
            LPVOID memPtr = MapViewOfFile(
                                            memMappedHandle,
                                            FILE_MAP_ALL_ACCESS,
                                            0,
                                            0,
                                            0
                                          );
            if (memPtr != NULL)
            {
                printf("view of file successfully created");
                printf("File size is: 0x%04X%04X\n", fileSizeHi, fileSizeLo);

                LPVOID eolPos = strchr((char*)memPtr, '\r');    // windows EOL sequence is \r\n
                lineLength = (char*)eolPos-(char*)memPtr;
                printf("Length of first line is: %ld\n", lineLength);

                memcpy(memPtr, eolPos+2, fileSizeLo-lineLength);
                UnmapViewOfFile(memPtr);
            }

            CloseHandle(memMappedHandle);
        }
        SetFilePointer(fileHandle, -(lineLength+2), 0, FILE_END);
        SetEndOfFile(fileHandle);
        CloseHandle(fileHandle);
    }
}

int main()
{
    const char inputFilename[] = "testInput2.txt";
    testFunc(inputFilename);
    return 0;
}
enhzflep
  • 12,927
  • 2
  • 32
  • 51
3

What you want to do, indeed, is not easy.

If you open the same file for reading and writing in it without being careful, you will end up reading what you just wrote and the result will not be what you want.

Modifying the file in place is doable: just open it, seek in it, modify and close. However, you want to copy all the content of the file except K bytes at the beginning of the file. It means you will have to iteratively read and write the whole file by chunks of N bytes.

Now once done, K bytes will remain at the end that would need to be removed. I don't think there's a way to do it with streams. You can use ftruncate or truncate functions from unistd.h or use Boost.Interprocess truncate for this.

Here is an example (without any error checking, I let you add it):

#include <iostream>
#include <fstream>
#include <unistd.h>

int main()
{
  std::fstream file;
  file.open("test.txt", std::fstream::in | std::fstream::out);

  // First retrieve size of the file
  file.seekg(0, file.end);
  std::streampos endPos = file.tellg();
  file.seekg(0, file.beg);

  // Then retrieve size of the first line (a.k.a bufferSize)
  std::string firstLine;
  std::getline(file, firstLine);

  // We need two streampos: the read one and the write one
  std::streampos readPos = firstLine.size() + 1;
  std::streampos writePos = 0;

  // Read the whole file starting at readPos by chunks of size bufferSize
  std::size_t bufferSize = 256;
  char buffer[bufferSize];
  bool finished = false;
  while(!finished)
  {
    file.seekg(readPos);
    if(readPos + static_cast<std::streampos>(bufferSize) >= endPos)
    {
      bufferSize = endPos - readPos;
      finished = true;
    }
    file.read(buffer, bufferSize);
    file.seekg(writePos);
    file.write(buffer, bufferSize);
    readPos += bufferSize;
    writePos += bufferSize;
  }
  file.close();

  // No clean way to truncate streams, use function from unistd.h
  truncate("test.txt", writePos);
  return 0;
}

I'd really like to be able to provide a cleaner solution for in-place modification of the file, but I'm not sure there's one.

Colin Pitrat
  • 1,992
  • 1
  • 16
  • 28
  • Would it be easier to just read and remove the last line? Such solution would be sufficent for me too. – user1587451 Mar 01 '16 at 14:22
  • With streams, AFAIK, unless you open with the truncate flag, you cannot reduce the size of the file, you can only increase it (but I may be wrong). – Colin Pitrat Mar 01 '16 at 14:24