Find X-largest values in a large file with optional input file command line parsing method in C++

Question

I have a file in the following fixed format:

<unique record identifier> <white_space> <numeric value>

e.g.

1426828011 9
1426828028 350
1426828037 25
1426828056 231
1426828058 109
1426828066 111
.
.
.

I want to write a program that reads from 'stdin' the contents of a file, and optionally accepts the absolute path of a file from the command line. The file/stdin stream is expected to be in the above format. The output should be a list of the unique ids associated with the X-largest values in the rightmost column, where X is specified by an input parameter.

For example, given the input data above and X=3, the following would be valid output:

1426828028
1426828066
1426828056

Note that the output does not need to be in any particular order. Multiple instances of the same numeric value count as distinct records of the total X. So if we have 4 records with values: 200, 200, 115, 110 and X=2 then the result must consist of the two IDs that point to 200 and 200 and no more.

Notice: take into account extremely large files.

My idea and brief implementation: Sorting by k-largest values

1st way: I want to read file content into multimap then iterate k elements to output
2nd way: Read file data into a vector<pair<int, int>> then use heap sort (priority queue).

I'm wondering which way has better time complexity & higher performance? Time complexity of 2nd way should be nlog(n). Is time complexity of 1st way log(n)? Please tell me both time & space complexity of the above methods and suggest any other better methods.

Besides, the input file is huge, so I think of using external sort. But I haven't done it before. I'd appreciate if someone can instruct me or write sample code of it for my better understanding.

Anyways, it's not required to sort output. We only need to print X-largest values in any order. So I'm wondering whether I need to do any sorting algorithm. The requirement to print the X-largest values in any order is weird, because we must sort it in descending order before printing. So I even don't know why it says "in any order" as if it makes the problem easier.

My brief code:

#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
//#include "stdafx.h"

using namespace std;

std::multimap<int, int> mp;

typedef std::pair<std::string, int> mypair;

struct IntCmp {
    bool operator()(const mypair &lhs, const mypair &rhs) {
        return lhs.second < rhs.second;
    }
};

void printK(const std::map<std::string,int> &mymap, int k) {
    std::vector<mypair> myvec(mymap.begin(), mymap.end());
    assert(myvec.size() >= k);
    std::partial_sort(myvec.begin(), myvec.begin() + k, myvec.end(), IntCmp());

    for (int i = 0; i < k; ++i) {
        std::cout << i << ": " << myvec[i].first 
            << "-> " << myvec[i].second << "\n";
    }
}

void readinfo(std::istream& in)
{
    std::string s, ID, Value;
    //while(getline(in, s))
        while(in >> ID >> Value)
        std::cout << s << '\n';
}

int main (int argc, char **argv) {

    if (argc > 1) {     /* read from file if given as argument */
        std::ifstream fin (argv[1]);
        if (fin.is_open())
            readinfo(fin);
        else 
        {
            std::cerr << "error: Unable to open file " << argv[1] << '\n';
            return 1;
        }
    }
    else 
        // No input file has been passed in the command line.
        // Read the data from stdin (std::cin).
    {
        readinfo(std::cin);
    }

    return 0;
}

But I don't know how to split the huge file to sort and combine back together. Please tell me how to fix my code for this problem.

@SashSinha I thought we should use min heap, shouldn't we? Anw, Can you show me a sample code to split huge input file to sort and merge several files into one? — Hector Ta, Apr 16 '22 at 14:35

Sash Sinha · Answer 1 · 2022-04-17T21:13:59.467

Maybe you could use a min-heap via std::priority_queue:

#include <cstdlib>
#include <fstream>
#include <iostream>
#include <queue>
#include <vector>

struct IdAndValue {
  std::string id;
  int value;
};

struct ValueCmp {
  bool operator()(const IdAndValue &lhs, const IdAndValue &rhs) {
    return lhs.value > rhs.value;
  }
};

void PrintTopK(std::istream &in, long k) {
  std::priority_queue<IdAndValue, std::vector<IdAndValue>, ValueCmp> largest_k;
  std::string id;
  int value;
  while (in >> id >> value) {
    if (largest_k.size() < k) {
      largest_k.push({.id = id, .value = value});
    } else {
      if (value > largest_k.top().value) {
        largest_k.pop();
        largest_k.push({.id = id, .value = value});
      }
    }
  }
  std::cout << "Top " << k << " IDs with largest values:\n";
  while (!largest_k.empty()) {
    IdAndValue id_and_value = largest_k.top();
    largest_k.pop();
    std::cout << id_and_value.id << '\n';
  }
}

int main(int argc, char **argv) {
  if (argc > 2) {  // Read from file if given as argument.
    std::ifstream fin(argv[1]);
    if (fin.is_open())
      PrintTopK(fin, std::strtol(argv[2], nullptr, 10));
    else {
      std::cerr << "Error: Unable to open file " << argv[1] << '\n';
      return 1;
    }
  } else {  // Read the data from stdin (std::cin).
    PrintTopK(std::cin, std::strtol(argv[1], nullptr, 10));
  }
  return 0;
}

Usage from stdin (Ctrl + D to send EOF on unix):

./PrintTopK 3
1426828011 9
1426828028 350
1426828037 25
1426828056 231
1426828058 109
1426828066 111
Top 3 IDs with largest values:
1426828066
1426828056
1426828028

Usage when passed in a file:

$ ./PrintTopK /Users/shash/CLionProjects/PrintTopK/data.txt 3
Top 3 IDs with largest values:
1426828066
1426828056
1426828028

With data.txt:

1426828011 9
1426828028 350
1426828037 25
1426828056 231
1426828058 109
1426828066 111

Using `eof()` tests in a while loop is a bad code smell -- error prone and usually wrong (though you end up getting it ok with a lot of extra work checking `fail()`). Far better would be `while (in >> id >> value) {` like the OP used — Chris Dodd, Apr 17 '22 at 18:56
Makes sense I changed from that after reading a comment which I may have taken to strongly: *"only the final while( !(in>>ws).eof() ) is generally robust"* in this [answer](https://stackoverflow.com/a/13536879/6328256) by [Tony Delroy](https://stackoverflow.com/users/410767/tony-delroy). — Sash Sinha, Apr 17 '22 at 20:42
That answer, though highly upvoted, appears mostly wrong IMO. — Chris Dodd, Apr 17 '22 at 20:55
Nice answer. I would like to add that for small `X` it might be [beneficial for performance](https://www.quick-bench.com/q/hSW_j6z_XHXo8kC_3FRGvxQU400) to perform a linear search on the `top_k` values. — Quxflux, Apr 18 '22 at 12:58
@SashSinha Thanks. But if our input file is too huge for our RAM, how to handle input splitting up in that case? — Hector Ta, Apr 21 '22 at 09:05

A M · Answer 2 · 2022-04-18T15:37:59.693

I think we can come up with a better approach that has a lower space and time complexity.

One requirement is to get the x largest values. Then we do only need to store x values. Not more. The others are of no interest. All values will be read and, if not larger than the already collected values, then we discard them. With that, we save tons of memory.

Next, how to store?

If we have an already sorted container, then the smallest element is always at the beginning. So, if we read a new value, then we just need to compare this new value with the first element in the container. Because, if the new value would be smaller than the smallest existing value, we can discard it. But if it is bigger, then we need to add it to our container, and eliminate the so far smallest element.

If we use a function like std::lower_bound then it will give us the exact position on where we need to insert the new element. Without the need for any resorting. It can be inserted at the exact correct position. Then we have a new smallest value.

To select the type of container, we think of the operations that we need to do.

We want to eliminate the first element (Without shifting all other data). - We want to add an element in a given position, without the need to shift all following values to the right.

This leads us to a std::list, which will fulfil our criteria in an optimal way.

So, how will we implement the solution?

Define a struct to hold the data, the unique id and the associated value
Add extraction >> and insertion << operators for easier IO
Add 2 sort operator overloads for std::list::sort and std::lower_bound
In main, get an std::istreameither to a given source file or std::cin
Read the first X values and store them in the list as is. If there should be only these X values or less, then we have the solution already
Sort the values in the list. The smalles value is now at the front of the list
If the std::istream contains still data, then continue to read values
If the new value id greater than than the smalled value in the list, then add the value to the list with insert sort
Delete the smallest value at the front.

After the initial sorting, all operations will be done in constant time. The number of input values does not add additional complexity. Anyway. Most time will be burned with reading data from the disk, if the file is huge. For small files with 100'000 values or so, any other algorithm would be also fine.

Please see one of many potential solutions below.

#include <iostream>
#include <fstream>
#include <random>
#include <string>
#include <list>
#include <limits>
#include <algorithm>

const std::string fileName{ "r:\\big.txt" };

// ----------------------------------------------------------------------------
// Create a big file for Test purposes
void createBigFile() {

    if (std::ofstream ofs(fileName); ofs) {

        constexpr size_t uniqueIDStart = 1'426'828'028;
        constexpr size_t numberOfRecords = 10'000'000;
        constexpr size_t endRecords = uniqueIDStart + numberOfRecords;

        std::random_device randomDevice;
        std::mt19937 randomEngine(randomDevice());
        std::uniform_int_distribution<int> uniformDistribution(1, 10'000'000);

        for (size_t k{ uniqueIDStart }; k < endRecords; ++k) {
            ofs << k << ' ' << uniformDistribution(randomEngine) << '\n';
        }
    }
}
// ----------------------------------------------------------------------------


// Here we will store our data
struct Data {
    unsigned int ID{};
    int value{ std::numeric_limits<int>::max() };

    // Sorting operators
    bool operator < (const int& i) const { return value < i; }                  // For lower_bound
    bool operator < (const Data& other) const { return value < other.value; }   // For sort

    // Simple extractor and inserter
    friend std::istream& operator >> (std::istream& is, Data& d) { return is >> d.ID >> d.value; }
    friend std::ostream& operator << (std::ostream& os, const Data& d) { return os << d.ID << ' ' << d.value; }
};


// Whatever number of X you need for the X-largest values
constexpr size_t Rank = 50;
// We will use a list to store the C-largest data
using DList = std::list<Data>;   
using ListIter = DList::iterator;

// For faster reading we will increase the input buffer size
constexpr size_t ifStreamBufferSize = 500'000u;
static char buffer[ifStreamBufferSize];

int main(int argc, char* argv[]) {

    // If you want to create test data, then uncomment the following line
    //createBigFile();

    //We will either read from std::cin or from a file
    std::shared_ptr<std::istream> input{};
    if (argc == 2) {
        // Try to open the source file, given by command line arguments
        input.reset(new std::ifstream(argv[1]));
        input->rdbuf()->pubsetbuf(buffer, ifStreamBufferSize);
    }
    else {
        // Use std::cin for input. Handover a NoOp custom deleter. We do not want to close std::cin
        input.reset(&std::cin, [](...) {});
    }

    // If file could be opened or if std::cin is OK
    if (input) {

        // Here we will store all data
        DList dList;

        // Read the first X values as is
        size_t numberOfElementsInArray{};
        Data data;
        
        while (*input >> data) {
            if (numberOfElementsInArray < Rank) {
                dList.push_front(std::move(data));
                ++numberOfElementsInArray;
            }
            if (numberOfElementsInArray >= Rank) break;
        }
        // And do a first time (and the only sort)
        dList.sort();

        // For comparison
        int smallest{ dList.front().value };
        
        // Read all data from file
        while (*input >> data) {
           
            // If the latest read value is bigger than the smalles in the list, the we need to add a new value now
            if (data.value > smallest) {

                // FInd the position, where to insert the new element
                ListIter dIter = std::lower_bound(dList.begin(), dList.end(), data.value);
                if (dIter != dList.end()) {
                    // Insert new value where is should be. Keeps sorting
                    dList.insert(dIter,std::move(data));

                    // We have now more values then needed. Get rid of the smalles one and get new smallest value.
                    dList.pop_front();
                    smallest = dList.front().value;
                }
            }
        }
        std::copy(dList.rbegin(), dList.rend(), std::ostream_iterator<Data>(std::cout, "\n"));
    }
    else std::cerr << "*** Error with input file (" << fileName << ") or with std::cin\n\n";
}

Can you explain more about the Handover a NoOp custom deleter? How can I implement (...) in `input.reset` ? — Hector Ta, Apr 24 '22 at 11:20
This is already the implementation. The custom deleter is a lambda that does nothing. It will especially not close `std::cin`. The `...` is called elipsis and means: any arguments (a parameter pack). You must enable C++17 for your compiler to get this to work. — A M, Apr 24 '22 at 11:42
I entered VS studio 2015 project properties->Language but don't see C++ Language Standard to set to ISO C++17. Besides,doesn't your method need to split input files into many chunks then merge after sorting I entered VS studio 2015 project properties->Language but don't see C++ Language Standard to set to ISO C++17. Besides,doesn't your method need to split input files into many chunks then merge after sorting in case our RAM can not process all data at once?? — Hector Ta, Apr 24 '22 at 16:16
VS 2015 does not know 2017. Please upgrade (for free) to VS2019 community edition. No, no split will be done. Only the K top elements will be stored. — A M, Apr 24 '22 at 17:38

Find X-largest values in a large file with optional input file command line parsing method in C++

2 Answers2