2

I have a very large text file that has over 11 million entries/lines. Each line has 35 values in it, each value is separated/delimited by a "|".

For each line that I am reading in, I am creating an object, "Record". I am storing them in a vector of Records because I need to be able to sort them based on the values in a given field. (Please suggest better approach if there is one)

I know how to override the istream>> operator, but I have never had to do it for an object this large, and I'm not sure what the best approach is. I tried to create tokens before each delimiter IE:

using namespace std; 

inline istream& operator>>(istream& is, Record& r) {
    string line_of_text;
    string token;
    char delim = '|';

    is >> temp;

    token = line_of_text.substr(0, line_of_text.find(delim));
    r.firstField = token;
    
    // so on for each field in Record

    return is;
}

but this is very impractical and inefficient.

Is there a reasonable way of doing this for such a large object? What is the best way to parse text like this without wasting so much memory?

Example line of input:

xx|0000|0| 0.00| 3.00|111|111| 5.70| 136000.00| 620.23| 80.00| 47.00| 0.000|FIX |P|C| 80.00|Full|SF|1.|P|convention|ME| 3| | |UnReported |WFHM |2 |N| |1|0|0|0|0|0| 126162.03| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00

I also tried just doing

inline istream& operator>>(istream& is, Record& r) {
    return is >> r.fieldOne >> r.fieldTwo; //....etc
}

but this does not work due to the fact that many fields are not separated with a space but just a '|', is there a graceful way to have >> skip the "|" as it does with blank spaces? Keep in mind there is a possibility for fields to be empty.

BadProgrammer
  • 99
  • 2
  • 8
  • 2
    You could use good old-fashioned [`strtok`](https://stackoverflow.com/questions/3889992/how-does-strtok-split-the-string-into-tokens-in-c) – Paul Sanders Nov 16 '20 at 08:57
  • 3
    Have you considered using a database instead (like sqlite)? You'll get all sorts of sorting and filtering "for free". – Mat Nov 16 '20 at 09:02
  • I was asked not to modify the original .txt file (as future use cases will be of the same format) and to solve this using only C++ (They don't want to add more dependencies and wanted it to be "simple") – BadProgrammer Nov 16 '20 at 09:05
  • 2
    You probably want `std::getline` instead of `operator >>`, it can read up to a delimiter. But you will need to provide conversion from string to number if members are ints/doubles. You can probably provide some helper functions to make it easier to handle. – Yksisarvinen Nov 16 '20 at 09:05
  • 3
    You can create a facet that treats `|` as whitespace. Fun, but I don't know if it is the fastest or most practical way. https://stackoverflow.com/a/14617422/920069 It looks like in your sample there's an empty field, so this might not be practical since it would just skip it and throw everything off. – Retired Ninja Nov 16 '20 at 09:07
  • 3
    I would also count the lines of the file before doing any parsing in order to set the size of the container (you said "std::vector") and avoid reallocations or copies. You may opt also for a linked list instead.. UPDATE: I see you want to sort it...go for a map or linked list , it should help data insertion. – Stefano Buora Nov 16 '20 at 09:07
  • If performance is a concern i would advice you to use memory mapped IO. Regarding the storage: Column-major is usually preferred for datasets this large or timeseries data, i.e one vector for each feature. This means that if you calculate your sort-order the relevant column is contiguous in memory. However you have to decide whether that is worth the effort. IMO 11 million entries is still very managable and as such row-major might be alot simpler for now. – Sebastian Hoffmann Nov 16 '20 at 09:47
  • Also notice that we have `std::from_chars` now (https://en.cppreference.com/w/cpp/utility/from_chars) – Sebastian Hoffmann Nov 16 '20 at 09:49
  • IMHO: are you sure you need to parse all the data and all the fields? could you just read the lines (as strings), store as they are, and only use the field of your interest for the sorting? I don't know the problem you are going to solve, in case... Again: Hope you are doing your development and tests on a subset of the whole archive, leaving the real execution on the huge archive only for later when the code is "almost" ready...it can save your day ;) – Stefano Buora Nov 16 '20 at 09:55

1 Answers1

3

I really wanted to find a use for pointer-to-member syntax for once, so...

You can use pointer-to-member syntax with a set of overloaded helpers to let compiler choose the correct convertor:

struct Record
{
    int x;
    std::string y;
    double z;
    
    void readInput(std::istream& in, int Record::*var)
    {
        std::string input;
        std::getline(in, input, '|');
        this->*var = std::stoi(input);
    }
    
    void readInput(std::istream& in, double Record::*var)
    {
        std::string input;
        std::getline(in, input, '|');
        this->*var = std::stod(input);
    }
    
    void readInput(std::istream& in, std::string Record::*var)
    {
        std::getline(in, this->*var, '|');
    }
};

With this, the operator >> would look like this:

std::istream& operator>>(std::istream& in, Record& r)
{
    r.readInput(in, &Record::x);
    r.readInput(in, &Record::y);
    r.readInput(in, &Record::z);
    //no need to handle last value as special case as long as stream ends there and you don't care that it will be in fail() state afterwards
    return in;
}

See it online


It would be possible to just provide free functions, which take a reference instead of pointer to member, e.g.:

void readInput(std::istream& in, int& var)
{
    std::string input;
    std::getline(in, input, '|');
    var = std::stoi(input);
}

with usage in operator >> like this:

readInput(in, r.x);

The core difference between these two approaches is whether you want it to be usable only with Record or you will always want to read ints delimited by | from istreams.

Yksisarvinen
  • 18,008
  • 2
  • 24
  • 52
  • This worked with a surprising level of efficiency. Thank you for the very clear and concise explanation, much appreciated! – BadProgrammer Nov 16 '20 at 10:13