6

I have a file of the following format:

1: some_basic_info_in_this_line
2: LOTS_OF_INFO_IN_THIS_LINE_HUNDREDS_OF_CHARS
3: some_basic_info_in_this_line
4: LOTS_OF_INFO_IN_THIS_LINE_HUNDREDS_OF_CHARS
...

That format repeats itself tens of thousands of times, making files up to 50 GiB+. I need an efficient way to process the only the line 2 of this format. I'm open to using C, C++11 STL, or boost. I've looked at various other questions regarding file streaming on SO, but I feel like my situation is unique because of the large file size and only needing one out of every four lines.

Memory mapping the file seems to be the most efficient from what I've read, but mapping a 50+ GB file will eat up most computers RAM (you can assume that this application will be used by "average" users - say 4-8 GiB RAM). Also I will only need to process one of the lines at a time. Here is how I am currently doing this (yes I'm aware this is not efficient, that's why I'm redesigning it):

std::string GL::getRead(ifstream& input)
{
    std::string str;
    std::string toss;
    if (input.good())
    {
        getline(input, toss);
        getline(input, str);
        getline(input, toss);
        getline(input, toss);
    }
    return str;
}

Is breaking the mmap into blocks the answer for my situation? Is there anyway that I can leverage only needing 1 out of 4 lines? Thanks for the help.

Iharob Al Asimi
  • 52,653
  • 6
  • 59
  • 97
zeus_masta_funk
  • 1,388
  • 2
  • 11
  • 34
  • Again, I wonder who downvoted this. It is an interesting enough question and is well posed. +1 – sehe Oct 17 '15 at 18:48
  • 1
    If your format is strictly defined with line sizes you can use input.seekg to skip the unwanted lines – Nir Oct 17 '15 at 18:50
  • The main bottleneck will be in the input operation itself. I suggest you experiment with various ways to load the data block by block on a typical user's computer. Not sure how to best handle a line that straddles two blocks (there will be such a line for most blocks). – Cheers and hth. - Alf Oct 17 '15 at 19:04
  • It's generally a good idea to check the input operations succeeded before returning `str` - as is the caller should check that if any flag is set in input, it's only `eof` and not `bad` or `fail`, but also check the `eof` at some stage - only to have `getRead()` check `input.good()` before input - that check should be redundant if the rest is done right. – Tony Delroy Oct 17 '15 at 19:05
  • Like Nir said, you have to check every character to see if it's a newline, unless you can skip some based on rules of your file format. `strchr` or whatever is used internally looking for newlines should easily be able to keep up with disk-read bandwidth, even from a decent SSD. – Peter Cordes Oct 17 '15 at 19:40

3 Answers3

5

Use ignore instead of getline:

std::string GL::getRead(ifstream& input)
{
    std::string str;
    if (!input.fail())
    {
        input.ignore(LARGE_NUMBER, '\n');
        getline(input, str);
        input.ignore(LARGE_NUMBER, '\n');
        input.ignore(LARGE_NUMBER, '\n');
    }
    return str;
}

LARGE_NUMBER could be std::numeric_limits<std::streamsize>::max() if you don't have a good reason to have a smaller number (think of DOS attacks)

TIP Consider passing str by reference. By reading into the same string each time, you can avoid a lot of allocations, which are typically the number 1 reason your program runs slow.

TIP Consider using a memoery mapped file (Boost Iostreams, Boost Interpocess, or mmap(1))

Community
  • 1
  • 1
sehe
  • 374,641
  • 47
  • 450
  • 633
  • 1
    I would check the success of the `getline` call. Not sure that it guarantees an empty string on failure. – Cheers and hth. - Alf Oct 17 '15 at 19:10
  • 1
    LARGE_NUMBER __should__ be `std::numeric_limits::max()`, it's a special case and tells `ignore` not to count characters, just to look for the delimiter. – Blastfurnace Oct 17 '15 at 19:26
  • @Blastfurnace Fixed (though I would likely use something else here to be safer. You don't want to have unexpected runaway IO) – sehe Oct 17 '15 at 19:27
  • @sehe: If you're going to use a non-infinite max count, it makes more sense to do it for the getline (which will allocate memory to hold the whole line) than for the `.ignore` which is just skipping characters. This is the wrong place to defend against having your program run on `/dev/zero` or something. – Peter Cordes Oct 17 '15 at 19:36
  • @PeterCordes Respectfully disagree. Of course you need more logic. Also note the Tips. – sehe Oct 17 '15 at 19:37
  • @sehe: If you're going to DOS this program, you'd do it with super-long lines that `getline` reads, not with super-long lines that are ignored. skipping over tons of characters until the next newline uses a lot of I/O and CPU time, but filling up memory is what will really bring a computer to its knees. If you have some kind of check that the file is correctly formatted at the start, and some sane limit on getline line length, I think sitting there reading a big file that was fed to the program is sensible behaviour. If you want to read and error out, sure. But using a finite number? – Peter Cordes Oct 17 '15 at 19:41
  • @PeterCordes Please let's stay on topic (DOS doesn't require the "computer [going] to its knees". It's literally: _denial of service_. If the program doesn't finish/takes way too long that is service denied.) – sehe Oct 17 '15 at 19:52
  • @zeus_masta_funk An interesting example of using a memory mapped file for lazily parsing randomly access (variable-length) text lines in a memory mapped file is this [the 2nd approach in this answer](http://stackoverflow.com/a/28220864/85371). – sehe Oct 17 '15 at 19:52
  • 1
    @sehe: my point was that working in smaller chunks in the ignore loops doesn't prevent infinite-loop behaviour. I think we both agree the right choice depends on what control you have over the input. e.g. if the beginning of the file looks ok, and it has finite size, then error-checking the rest of the file can be kept fairly simple. If you're running this on potentially-malicious files, you need to do early-out checks along the way for al the lines, esp. the one you read with `getline`. – Peter Cordes Oct 17 '15 at 22:11
2

Memory-mapping a file doesn't load it into RAM. It takes up virtual address space for the process, but not physical RAM. The mmap system call will simply fail on a 32bit system, because 4GiB of virtual address space isn't enough for a 50GiB file. On a 64bit system, it will take microseconds. (No disk read, because the file is already open so the file metadata is already loaded.)

Only the pages you actually read from are loaded from disk, and pages can be unmapped again whenever the OS wants to reclaim some memory. (Because if you read them again later, the OS can reload from disk. It's like swapping them out to swap space / pagefile, but without having to write because there's already a clean copy on disk.)

A memory mapping lets your process read the OS's page-cache pages, rather than making copies of them with a read system call.

Have a look at wikipedia for more info.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
0

This is the most efficient solution that I could come up with that is platform independent. I thought about all of the users, and for now am running under the assumption that everyone has a 64bit machine if they are using 4+ GiB file sizes. If this changes, I will have to modify the class to handle "blocks" of data into separate mmap regions.

#include <string>
#include <boost/iostreams/device/mapped_file.hpp>

//////////////////////////////////////////////////////////////////////////////////////////
/// @class LineParser
///
/// @brief Class to efficiently parse a file and take the second line out of every 4 lines
///
/// This class uses memory-mapped io to efficiently extract and return sequences from a
/// file
//////////////////////////////////////////////////////////////////////////////////////////
class LineParser
{
private:
    boost::iostreams::mapped_file mmap; ///< Object for memory mapped file
    const char* curr;                   ///< Current position of the file
    const char* end;                    ///< End position of the file

public:
    //////////////////////////////////////////////////////////////////////////////////////
    /// @fn valid
    ///
    /// Indicates whether the parser is in a valid state or not
    ///
    /// @return Boolean indicating if the parser is open and in a valid state
    ///
    /// @note Declared inline as it is acceptable in my situation because of being called 
    ///       many times in only a few spots    
    //////////////////////////////////////////////////////////////////////////////////////
    inline bool valid(void)
    {
        return (curr && end && (curr < end) && (mmap.is_open()));
    }

    //////////////////////////////////////////////////////////////////////////////////////
    /// @fn peek
    ///
    /// Obtains the next sequence string - if it exists - but maintains parsers state
    ///
    /// @return Next sequence available in the file. Emptry string returned if none 
    ///         exist
    ///
    /// @note Declared inline as it is acceptable in my situation because of being called 
    ///       many times in only a few spots
    //////////////////////////////////////////////////////////////////////////////////////
    inline std::string peek(void)
    {
        const char* save = curr;
        std::string ret;

        getRead(ret);

        curr = save;
        return ret;
    }

    //////////////////////////////////////////////////////////////////////////////////////
    /// @fn getRead
    ///
    /// Sets container to the current read being processed
    ///
    /// @param container String container to place current sequence into
    ///
    /// @return Boolean indicating if getting a new read was successful
    ///
    /// @note Declared inline as it is acceptable in my situation because of being called 
    ///       many times in only a few spots
    //////////////////////////////////////////////////////////////////////////////////////
    inline bool getRead(std::string& container)
    {
        if (valid() == false)
        {
            return false;
        }

        curr = static_cast<const char*>(memchr(curr, '\n', end-curr)) + 1;
        const char* index = static_cast<const char*>(memchr(curr, '\n', end-curr));
        container = std::string(curr, index - curr);
        curr = index + 1;
        curr = static_cast<const char*>(memchr(curr, '\n', end-curr)) + 1;
        curr = static_cast<const char*>(memchr(curr, '\n', end-curr)) + 1;

        return true;
    }

    //////////////////////////////////////////////////////////////////////////////////////
    /// @fn LineParser
    ///
    /// Constructor to initialize memory mapped file and set index values
    //////////////////////////////////////////////////////////////////////////////////////
    LineParser(const std::string& filepath) 
        : mmap(filepath, boost::iostreams::mapped_file::readonly)
    {
        if (mmap.is_open())
        {
            curr = mmap.const_data();
            end = curr + mmap.size();
        }
    }

    //////////////////////////////////////////////////////////////////////////////////////
    /// @fn ~LineParser
    ///
    /// Default destructor
    //////////////////////////////////////////////////////////////////////////////////////
    ~LineParser(void) = default;
};

Please note that this class is not perfect and might not handle deviations in file format very well, but under the assumption of perfect file format it works nicely.

nobody
  • 19,814
  • 17
  • 56
  • 77
zeus_masta_funk
  • 1,388
  • 2
  • 11
  • 34