0

I'm currently in the process of making a program to read a large number of text files, and searching for regular expressions, then saving the line text and line number, as well as the file name and file folder path, and writing that data to a .csv file. The method I'm using is as follows:


    string line;
    ifstream stream1(filePath)
    {  
        while (getline(stream1,line))
        { 
            // Code here that compares regular search expression to the line
            // If match, save data to a tuple for later writing to .csv file.
        } 
    }

I'm wondering if there is a faster method to do this. I wrote the same type of program in Matlab (which I'm more experienced in) using the same logic as described above, going line by line. I had run time down to roughly 5.5 minutes for 300 MB of data (which I'm not even sure if that's fast or not, probably not), but in Visual Studio it's taking as much as 2 hours for the same data.

I had heard of how fast C++ can be for data reading/writing so I'm a little confused by these results. Is there a faster method? I tried looking around online but all I found was memory mapping which seemed to only be Linux/Unix?

  • 6
    Possible duplicate of [Fast textfile reading in c++](https://stackoverflow.com/questions/17925051/fast-textfile-reading-in-c) – vik_78 Apr 16 '19 at 12:05
  • What if the search pattern is split across multiple lines? – stark Apr 16 '19 at 15:00

1 Answers1

0

You can use memory-mapped files.

Since you’re on Windows, the correct API is probably CAtlFileMapping<char> template class. Here's an example.

#include <atlfile.h>

// Error-checking macro
#define CHECK( hr ) { const HRESULT __hr = ( hr ); if( FAILED( __hr ) ) return __hr; }

HRESULT testMapping( const wchar_t* path )
{
    // Open the file
    CAtlFile file;
    CHECK( file.Create( path, GENERIC_READ, FILE_SHARE_READ, OPEN_EXISTING ) );
    // Map the file
    CAtlFileMapping<char> mapping;
    CHECK( mapping.MapFile( file ) );
    // Query file size
    ULONGLONG ullSize;
    CHECK( file.GetSize( ullSize ) );

    const char* const ptrBegin = mapping;
    const size_t length = (size_t)ullSize;
    // Process the mapped data, e.g. call memchr() to find your new lines

    return S_OK;
}

Don’t forget that for 32-bit processes address space is limited, compiling a 64-bit program makes a lot of sense for this application.

Also, if your files are very small, you have huge count if them, and they are stored on a fast SSD, better approach is processing multiple files in parallel. But it’s somewhat harder to implement.

Soonts
  • 20,079
  • 9
  • 57
  • 130