0

Good afternoon.

I ran into some problems. I can solve it, but I have little experience in programming, and it seems to me that there is a more beautiful and rational solutions to this problem.

The problem is as follows. Given a set of text files with a total size of more than one hundred megabytes. The number of files from 2 to N. Files contain sorted unique numbers (for example, IDs). It wanted to merge all the numbers in one output file. Inside the resulting file numbers also need to be sorted.

I'm going to solve this problem as follows: Open all files. Take out the first number of each file. Put them in a container (eg, a vector). Find the smallest number within the container. Put the minimal number in the output file. In his place, put the following number from the file, from which was taken the minimal number. It seems that this method is called "external merge sort".

Tell me, please, is there a smarter way to solve this problem?

3 Answers3

3

External mergesort was created for this exact problem. The complexity of this sort is N * log(number_of_files).

For simplicity in your code you can store the filestreams along with the numbers in a priority queue.

Completely untested, but maybe helpful code:

std::vector<ifstream> file_streams(stream_count);  
// open streams.
using int_and_stream = std::pair<int, std::ifstream&>;
using cont = std::vector<int_and_stream>;
std::priority_queue<int_and_stream, cont, pair_comparer> queue;

for(auto& stream: file_streams){
   int id;
   stream >> id;
   queue.push(std::make_pair(id, stream));
}

while(!queue.empty()){
   auto smallest = queue.top();
   outstream << smallest.first;
   int id;
   if(smallest.second >> id){ // file ended?
      queue.push(std::make_pair(id, stream));
   }
}              

For the pair_comparer you can look here

Community
  • 1
  • 1
Captain Giraffe
  • 14,407
  • 6
  • 39
  • 67
  • It seems that all the improvements are reduced to using a priority queue. I think that's what I'll do. Many thanks to all of you. – Merovingean Jul 09 '15 at 15:50
  • @Captain Giraffe There was little difficulty. Here: queue.push(std::make_pair(id, stream)); An error number С2280. Forbidden the copy constructor for ifstream. I tried using std::reference_wrapper, but did not succeed. – Merovingean Jul 09 '15 at 23:58
  • @Merovingean Try `new int_and_stream(id, stream)` or ask a new question. – Captain Giraffe Jul 10 '15 at 11:29
0

Your approach is good.

But you can have a sorted vector of number/file pairs, so you would spend less time finding the smallest number, as after taking the smallest entry out, you can read next value and insert it back into array using more efficient algorithm, than linear sort. This is more efficient when you have large number of input files.

But assuming that cost of I/O is much more expensive, than number comparison, the approach is OK.

Valeri Atamaniouk
  • 5,125
  • 2
  • 16
  • 18
  • Using a `std::set>` would be easiest, where `Number` can be an integral type or floating type depending on the data. – R Sahu Jul 09 '15 at 15:09
0

A better approach would be to store the current head of each file in a priority queue. Then in each step you take the top of the priority queue (noting the input file this item is taken from), write it to the output file, and push the next item of that input file into the priority queue.

Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158