1

I am a beginner at C++. I created a text file with two columns in it. However, there are around 1 million rows and there are many rows that repeat each other. I want to delete the duplicates and count how many duplicates there were making it into a third row. This is what it would look like before and after:

Before:

10 8

11 7

10 8

10 8

15 12

11 7

After:

10 8 3

11 7 2

15 12 1

I don't really know where to start can someone point me in the right direction of what I should be looking up in order to do this?

  • 1
    This sounds like a job for [`std::unordered_map`](https://en.cppreference.com/w/cpp/container/unordered_map) – 0x5453 Jul 20 '21 at 16:01
  • 2
    See `std::sort`, `std::unique` . IMHO, you should use operating system tools, like `sort` and `uniq`. – Thomas Matthews Jul 20 '21 at 16:22
  • I was wondering if we can define a struct that has 2 integers as the key, and then use std::map along with key_compare() or operator<() in this case ? – Job_September_2020 Jul 20 '21 at 17:15
  • This old stackoverflow answer may be helpful to you : https://stackoverflow.com/questions/52639730/writing-a-custom-comparator-for-a-c-map-which-has-a-custom-defined-key – Job_September_2020 Jul 20 '21 at 17:31

4 Answers4

0

You can create std::map<std::pair<int, int>, int>, and after each insertion check if the given pair is contained in the map. If pair is contained just increment number of duplicates, otherwise emplace it in the map.

Something like this:

#include <iostream>
#include <map>

int main(int argc, char* argv[]) {
  std::map<std::pair<int, int>, int> rows;

  int num1;
  int num2;
  while (std::cin >> num1 >> num2) {
    auto pair = std::make_pair(num1, num2);
    if (rows.find(pair) != rows.end())
      ++rows[pair];
    else
      rows.emplace(pair, 1);
  }
}
Dzemo997
  • 328
  • 2
  • 5
  • I am curious about a more complex scenario. Suppose the key includes 3 integers (instead of 2 integers as in this case), how can we handle the comparison operation of 3 integers since "std::pair" may not work for 3 integers ? – Job_September_2020 Jul 20 '21 at 17:23
  • In that case instead of std::pair use some data structure, maybe std::vector will be good, or std::set if all values in the row are different. – Dzemo997 Jul 20 '21 at 17:29
  • 2
    I believe it should be `++rows[pair]` instead of `rows[pair] = 1`. – ivan.ukr Jul 20 '21 at 19:07
  • Why not directly `++rows[pair]` without the `if else`, since `rows[pair]` will construct non-existent elements by default? – xskxzr Jul 21 '21 at 08:30
  • Because value is not 0, it has some unknown value. Because of that we need to initialize value it key does not exist and after that just increment it. – Dzemo997 Jul 21 '21 at 16:03
0

This can be done with a std::priority_queue, which automatically sorts the entries. With the data sorted like this, one only has to count the number of subsequent identical entries:

#include <queue> 
#include <iostream>
#include <vector>
#include <utility> // for std::pair

int main() {
  std::priority_queue<std::pair<int,int>> mydat;
  mydat.push(std::make_pair(10,8));
  mydat.push(std::make_pair(11,7));
  mydat.push(std::make_pair(10,8));
  mydat.push(std::make_pair(10,8));
  mydat.push(std::make_pair(15,12));
  mydat.push(std::make_pair(11,7));

  std::vector<std::vector<int>> out;
  std::pair<int,int> previous;
  int counter;
  
  while(!mydat.empty()) {
    counter = 1;
    previous = mydat.top();
    mydat.pop(); // move on to next entry
    while(previous == mydat.top() && !mydat.empty()) {
      previous = mydat.top();
      mydat.pop();
      counter++;
    }
    out.push_back({previous.first, previous.second, counter});
  }
  for(int i = 0; i < out.size(); ++i) {
    std::cout << out[i][0] << " " << out[i][1] << " " << out[i][2] << std::endl;
  }
}
   

godbolt demo

Output:

15 12 1 
11 7 2 
10 8 3
RHertel
  • 23,412
  • 5
  • 38
  • 64
0
#include <string>
#include <fstream>
#include <unordered_map>

using namespace std;

int main()
{
    string line;
    unordered_map<string, int> count_map;

    ifstream src("input.txt");
    if (!src.is_open())
    {
        return -1;
    }

    while (getline(src, line))
    {
        if (line.empty())
            continue;

        count_map[line]++;
    }    
    src.close();

    ofstream dst("output.txt");
    if (!dst.is_open())
    {
        return -2;
    }

    for (auto & iter : count_map)
    {
        dst << iter.first << " " << iter.second << endl;
    }
    dst.close();

    return 0;
}
secuman
  • 539
  • 4
  • 12
0
#include <iostream>
#include <fstream>
#include <string>
#include <map>
#include <set>

using namespace std;

int main() {
    ifstream src("input.txt");
    if (!src.is_open()) {
        return -1;
    }
    // store each line, filter out all of duplicated strings
    set<string> container;
    // key is to maintain the order of lines, value is a pair<K, V>
    // K is the itor pointed to the string in the container
    // V is the counts of the string
    map<int, std::pair<set<string>::iterator, int>> mp;
    // key is the pointer which points to the string in the container
    // value is the index of string in the file
    map<const string *, int> index; 
    string line;
    int idx = 0; // index of the string in the file

    while (getline(src, line)) {
        if (line.empty()) {
            continue;
        }
        auto res = container.insert(line);
        if (res.second) {
            index[&(*res.first)] = idx;
            mp[idx] = {res.first, 1};
            idx++;
        } else {
            mp[index[&(*res.first)]].second += 1;
        }
    }
    src.close();

    ofstream dst("output.txt");
    if (!dst.is_open()) {
        return -2;
    }

    for (const auto & iter : mp) {
        dst << *iter.second.first << " " << iter.second.second << endl;
    }
    dst.close();

    return 0;
}

BTW, Redis can solve this problem easily if you are allowed to use it.

Yves
  • 11,597
  • 17
  • 83
  • 180