I have a file in the following fixed format:
<unique record identifier> <white_space> <numeric value>
e.g.
1426828011 9
1426828028 350
1426828037 25
1426828056 231
1426828058 109
1426828066 111
.
.
.
I want to write a program that reads from 'stdin' the contents of a file, and optionally accepts the absolute path of a file from the command line. The file/stdin stream is expected to be in the above format. The output should be a list of the unique ids associated with the X-largest values in the rightmost column, where X is specified by an input parameter.
For example, given the input data above and X=3, the following would be valid output:
1426828028
1426828066
1426828056
Note that the output does not need to be in any particular order. Multiple instances of the same numeric value count as distinct records of the total X. So if we have 4 records with values: 200, 200, 115, 110 and X=2 then the result must consist of the two IDs that point to 200 and 200 and no more.
Notice: take into account extremely large files.
My idea and brief implementation: Sorting by k-largest values
1st way: I want to read file content into multimap then iterate k elements to output
2nd way: Read file data into a vector<pair<int, int>> then use heap sort (priority queue).
I'm wondering which way has better time complexity & higher performance? Time complexity of 2nd way should be nlog(n). Is time complexity of 1st way log(n)? Please tell me both time & space complexity of the above methods and suggest any other better methods.
Besides, the input file is huge, so I think of using external sort. But I haven't done it before. I'd appreciate if someone can instruct me or write sample code of it for my better understanding.
Anyways, it's not required to sort output. We only need to print X-largest values in any order. So I'm wondering whether I need to do any sorting algorithm. The requirement to print the X-largest values in any order is weird, because we must sort it in descending order before printing. So I even don't know why it says "in any order" as if it makes the problem easier.
My brief code:
#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
//#include "stdafx.h"
using namespace std;
std::multimap<int, int> mp;
typedef std::pair<std::string, int> mypair;
struct IntCmp {
bool operator()(const mypair &lhs, const mypair &rhs) {
return lhs.second < rhs.second;
}
};
void printK(const std::map<std::string,int> &mymap, int k) {
std::vector<mypair> myvec(mymap.begin(), mymap.end());
assert(myvec.size() >= k);
std::partial_sort(myvec.begin(), myvec.begin() + k, myvec.end(), IntCmp());
for (int i = 0; i < k; ++i) {
std::cout << i << ": " << myvec[i].first
<< "-> " << myvec[i].second << "\n";
}
}
void readinfo(std::istream& in)
{
std::string s, ID, Value;
//while(getline(in, s))
while(in >> ID >> Value)
std::cout << s << '\n';
}
int main (int argc, char **argv) {
if (argc > 1) { /* read from file if given as argument */
std::ifstream fin (argv[1]);
if (fin.is_open())
readinfo(fin);
else
{
std::cerr << "error: Unable to open file " << argv[1] << '\n';
return 1;
}
}
else
// No input file has been passed in the command line.
// Read the data from stdin (std::cin).
{
readinfo(std::cin);
}
return 0;
}
But I don't know how to split the huge file to sort and combine back together. Please tell me how to fix my code for this problem.