Keep track of highest 5 numbers during file input

Question

So lets say i have a struct

struct largestOwners
{
    string name;
    double amountOwned;
};

And i am reading it in from a file using ifstream with 300 names and amounts.

How can i go about keeping track of the highest 5 numbers during input? So i dont have to sort after, but rather track it during ifstream input.

My goal is to keep track of the 5 highest amounts during input so i can easily print it out later. And save time/processing rather than to do it in the future

I get i can store this in an array or another struct, but whats a good algorithm for tracking this during ifstream input to the struct?

Lets say the text file looks like this, when im reading it in.

4025025 Tony
66636 John
25 Tom
23693296 Brady
363 Bradley
6200 Tim

Thanks!

Have you thought about how you would do it if you were using plain numbers? — Beta, Nov 13 '15 at 07:43
Are you looking for an algorithm or for an algorithm plus an implementation? What have you tried yourself? — moooeeeep, Nov 13 '15 at 07:47
I am looking how to keep track of the highest values during ifstream readin to the struct rather than sort later. — soniccool, Nov 13 '15 at 07:47

Rushil Paul · Answer 1 · 2015-11-13T08:51:48.930

4

To keep track of the highest 5 numbers in a stream of incoming numbers, you could use a min-heap of size 5 (C++ STL set can be used as a min-heap).

First fill the min-heap with the first 5 numbers. After that, for each incoming element, compare that with the smallest of the largest 5 numbers that you have (root of the min-heap). If the current number is smaller than that, do nothing, otherwise remove the 5th largest (pop from min-heap) and insert the current number to the min-heap.

Deleting and inserting in the min-heap will take O(log n) time.

For example, consider the following stream of numbers:

1 2 5 6 3 4 0 10 3

The min-heap will have 1 2 3 5 6 initially.

On encountering 4, 1 gets removed and 4 gets inserted. Min heap now looks like this: 2 3 4 5 6

On encountering 0, nothing happens.

On encountering 10, 2 gets removed and 10 gets inserted. Min heap now looks like this: 3 4 5 6 10

On encountering 3, nothing happens.

So your final set of 5 largest elements are contained in the heap (3 4 5 6 10)

You can even tweak this to keep track of the k highest elements in an incoming stream of numbers. Just change the size of the min-heap to k.

edited Nov 13 '15 at 08:51

answered Nov 13 '15 at 07:52

Rushil Paul

2,010
4
26
39

2

+1 for the good approach. I guess abstractly it's actually the fastest way. Nevertheless, for 5 elements a heap is overkill, a simple unsorted for loop would do better in this case. – Stefan Nov 13 '15 at 08:32
@Stefan Yes but I wanted the OP to know that a heap can be used for a similar problem with a larger value of k. – Rushil Paul Nov 13 '15 at 08:50
1

Therefore the upvote. On the other hand, I've seen to many people wasting time on elaborating complex solutions being then second best to the simple approaches. I wanted to point that out, as the op (or others coming here to find find inspiration) might not know the difference between theoretical and practical efficiency. – Stefan Nov 13 '15 at 09:07
1

Actually I think this approach is even easier to implement than an approach without using the sorted set, thus having to implement erasing the minimum element oneself. – moooeeeep Nov 13 '15 at 09:30
@Stefan No, an unsorted for loop is worse. That requires 5 comparisons per element in the big array, while (assuming the array is unsorted), the heap approach requires `n + O(log(n))` comparisons. A sorted for loop would be approximately similar though. The heap is most efficient until the array is big and you want a fixed percentage of the array. At that point this is `O(n log(n))` while quick select is `O(n)`. (Though with worse constants - which is why the heap wins for just a few elements.) – btilly Nov 13 '15 at 16:18

score 2 · Answer 2 · edited Nov 13 '15 at 18:59

While reading the file, keep a sorted list of the 5 largest numbers seen (and their owners). Whenever you read a value higher than the lowest of the 5, remove the lowest and insert the new number in your sorted list.

List list can be stored in an array or in any other data structure that has an order and where you can implement a sort and insert. (Or where this is already implemented)

Instead of sorting the list you can also simply go through the 5 entries every time you read a new one (should not be too bad, because 5 entries is a very small number)

moooeeeep · Answer 3 · 2015-11-15T20:28:06.793

You can use the standard library function std::nth_element() for this.

It should be fairly easy to implement a comparison function (or overload the comparison operator) for your struct. Then you'd just parse the file into a vector of those and be done with it. The algorithm uses a partial sort, linear time on average.

Here's the example given on the documentation site I've linked below:

#include <iostream>
#include <vector>
#include <algorithm>
#include <functional>

int main()
{
    std::vector<int> v{5, 6, 4, 3, 2, 6, 7, 9, 3};

    std::nth_element(v.begin(), v.begin() + v.size()/2, v.end());
    std::cout << "The median is " << v[v.size()/2] << '\n';

    std::nth_element(v.begin(), v.begin()+1, v.end(), std::greater<int>());
    std::cout << "The second largest element is " << v[1] << '\n';
}

For reference:

http://en.cppreference.com/w/cpp/algorithm/nth_element

Out of curiosity, I have implemented some approaches:

#include <algorithm>
#include <functional>
#include <queue>
#include <set>
#include <vector>

std::vector<int> filter_nth_element(std::vector<int> v, int n) {
    auto target = v.begin()+n;
    std::nth_element(v.begin(), target, v.end(), std::greater<int>());
    std::vector<int> result(v.begin(), target);
    return result;
}

std::vector<int> filter_pqueue(std::vector<int> v, int n) {
    std::vector<int> result;
    std::priority_queue<int, std::vector<int>, std::greater<int>> q;
    for (auto i: v) {
        q.push(i);
        if (q.size() > n) {
            q.pop();
        }
    }
    while (!q.empty()) {
        result.push_back(q.top());
        q.pop();
    }
    return result;
}

std::vector<int> filter_set(std::vector<int> v, int n) {
    std::set<int> s;
    for (auto i: v) {
        s.insert(i);
        if (s.size() > n) {
            s.erase(s.begin());
        }
    }
    return std::vector<int>(s.begin(), s.end());
}

std::vector<int> filter_deque(std::vector<int> v, int n) {
    std::deque<int> q;
    for (auto i: v) {
        q.push_back(i);
        if (q.size() > n) {
            q.erase(std::min_element(q.begin(), q.end()));
        }
    }
    return std::vector<int>(q.begin(), q.end());
}

std::vector<int> filter_vector(std::vector<int> v, int n) {
    std::vector<int> q;
    for (auto i: v) {
        q.push_back(i);
        if (q.size() > n) {
            q.erase(std::min_element(q.begin(), q.end()));
        }
    }
    return q;
}

And I have made up some tests:

#include <random>
#include <iostream>
#include <chrono>

std::vector<int> filter_nth_element(std::vector<int> v, int n);
std::vector<int> filter_pqueue(std::vector<int> v, int n);
std::vector<int> filter_set(std::vector<int> v, int n);
std::vector<int> filter_deque(std::vector<int> v, int n);
std::vector<int> filter_vector(std::vector<int> v, int n);

struct stopclock {
    typedef std::chrono::high_resolution_clock high_resolution_clock;
    std::chrono::time_point<high_resolution_clock> start, end;
    stopclock() : start(high_resolution_clock::now()) {}
    ~stopclock() {
        using namespace std::chrono;
        auto elapsed = high_resolution_clock::now() - start;
        auto elapsed_ms = duration_cast<milliseconds>(elapsed);
        std::cout << elapsed_ms.count() << " ";
    }
};

int main() {
    // randomly initialize input array
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dist;
    std::vector<int> v(10000000);
    for (auto &i: v)
        i = dist(gen);
    // run tests
    for (std::vector<int>::size_type x = 5; x <= 100; x+=5) {
        // keep this many values
        std::cout << x << " ";
        {
            stopclock t;
            auto result = filter_nth_element(v, x);
        }
        {
            stopclock t;
            auto result = filter_pqueue(v, x);
        }
        {
            stopclock t;
            auto result = filter_set(v, x);
        }
        {
            stopclock t;
            auto result = filter_deque(v, x);
        }
        {
            stopclock t;
            auto result = filter_vector(v, x);
        }
        std::cout << "\n";
    }
}

And I found it quite interesting to see the relative performance of these approaches (compiled with -O3 - I think I have to think a bit about these results):

score 0 · Answer 4 · answered Nov 13 '15 at 07:49

A binary search tree could be a suitable data structure for this problem. Maybe you can find a suitable Tree class in STL or Boost or so (try to look for that). Otherwise simply use a struct if you insist.

The struct would be like that:

struct tnode {        /* the tree node: */
    char *word;           /* points to the text */
    int count;            /* number of occurrences */
    struct tnode *left;   /* left child */
    struct tnode *right;  /* right child */
};

Taken from The C Programming Language, chapter 6.5 - Self-referential Structures. Just adapt it to your needs.

Though, I think if you want to program in C++ properly, try to create a Tree data structure (class) or try to use an existing one.

Considering that you only have 300 entries, that should do it. In theory when the input data is random it is supposed to be efficient. But that is theory and does not really play a role in your case. I think it is a good solution.

score 0 · Answer 5 · answered Nov 13 '15 at 07:49

0

You can use sorted buffer of 5 elements and on each step if item is higher than lowest item of the buffer, put item in buffer and evict lowest

answered Nov 13 '15 at 07:49

Nyavro

8,806
2
26
33

mksteve · Answer 6 · 2015-11-13T08:03:19.297

Use a map of elements

First Create a class

  class Data {
    public:
       std::string name;
       int         number;
};


 typedef std::map< int, Data > DataItems;
 DataItems  largest;

If the size of largest is < 5, then you haven't read five elements.

 if( largest.size() < 5 ) {
     largest[ dt.number] = dt;
 } else {

Otherwise - if it is larger than the smallest of the largest five, then the largest five has changed.

     DataItems::iterator it = largest.begin(); // lowest current item.
     if( it->second.number < dt.number ) { // is this item bigger? - yes
          largest[ dt.number ] = dt;       // add it (largest.size() == 6)
          largest.erase( largest.begin() );// remove smallest item
     }
 }

oo_miguel · Answer 7 · 2015-11-13T08:45:20.060

You can use a set to keep track of the highest values. If you want to track non-unique numbers use a multiset instead:

vector<int> nums{10,11,12,1,2,3,4,5,6,7,8,9}; //example data

int k=5;       // number of values to track
set<int> s;    // this set will hold the result

for(auto a: nums)
{
    if(s.size()<k)s.insert(a); 
    else if(a>*s.begin())
    {
       s.erase(s.begin());
       s.insert(a);
    }
}

Of course you will have to provide a custom comparison function for your struct.

score 0 · Answer 8 · answered Aug 29 '22 at 18:06

0

I'm surprised nobody has mentioned priority queue data-structure that's made exactly for this https://en.cppreference.com/w/cpp/container/priority_queue

answered Aug 29 '22 at 18:06

Soup Endless

439
3
10

Keep track of highest 5 numbers during file input

8 Answers8

Linked