efficient way to remove duplicate pairs of integers in C

Question

I've found answers to similar problems, but none of them exactly described my problem. so on the risk of being down-voted to hell I was wondering if there is a standard method to solve my problem. Further, there's a chance that I'm asking the wrong question. Maybe the problem can be solved more efficiently another way.

So here's some background: I'm looping through a list of particles. Each particle has a list of it's neighboring particles. Now I need to create a list of unique particle pairs of mutual neightbours. Each particle can be identified by an integer number.

Should I just build a list of all the pair's including duplicates and use some kind of sort & comparator to eliminate duplicates or should I try to avoid adding duplicates into my list in the first place?

Performance is really important to me. I guess most of the loops may be vectorized and threaded. On average each particle has around 15 neighbours and I expect, that there will be 1e6 particles at most.

I do have some ideas, but I'm not an experienced coder and I don't want to waste 1 week to test every single method by benchmarking different situations just to find out that there's already a standard meyjod for my problem. Any suggestions?

BTW: I'm using C.

Some pseudo-code

for i in nparticles
  particle=particles[i]; //just an array containing the "index" of each particle
                         //each particle has a neightbor-list
  for k in neighlist[i]  //looping through all the neighbors
  //k represent the index of the neighbor of particle "i"
    if the pair (i,k) or (k,i) is not already in the pair-list, add it. otherwise don't

Could you please post some code that does what you want (regardless of performance) so we can understand better what you want? — aaaaaa123456789, Dec 01 '18 at 16:35
It's a trivial task if you want to use a Linux, Unix shell command. Do you want really to code that in C? In shell scripting it's just a one line, sort - u, that is, sort unique on the input file. — Krassi Em, Dec 01 '18 at 16:40
You can start by always creating your pairs with the lower ID in the first entry. Then you'd only have to check for `(i,k)` and not `(k,i)`. — 1201ProgramAlarm, Dec 01 '18 at 16:44
yeah I know, I can do that in shell easily by sorting twice. I can also do this easily in python. But it has to be efficient. This is going to be the dominant part of a code simulation hundreds of millions of particle on a HPC — OD IUM, Dec 01 '18 at 16:44
that's a good start @1201ProgramAlarm. Stll I'm wondering if I should sort "on the fly" and reallocate my array or eliminating duplicates at the end — OD IUM, Dec 01 '18 at 16:46
Just number the (ordered) pairs by combining the two numbers into a (64bit) large key, and use that as a key for a hash table. 30 minutes of work. — wildplasser, Dec 01 '18 at 16:50
The data structure you described is a graph. You're attempting to make an edge list for the graph. There are representations of graphs that efficiently maintain the edge list as the graph is built. If you use one of these, what you're trying to do will take zero time. — Gene, Dec 01 '18 at 19:35
@Gene Thx for the hint. Could you be more specific pls? All I found was 1. "adjacency matrix"--> that would be perfect, but the memory complexity eliminates this option and 2: adjacency list --> doesn't solve my problem. Is there someting else? — OD IUM, Dec 02 '18 at 15:35

score 0 · Answer 1 · answered Dec 01 '18 at 17:04

0

Sorting the elements each iteration is not a good idea since comparison sort is O(n log n) complex.

The next best thing would be to store the items in a search tree, better yet binary search tree, and better yet self equalizing binary search tree, you can find implementations on GitHub.

Even better solution would give an access time of O(1), you can achieve this in 2 different ways one is a simple identity array, where at each slot you would save say a pointer to item if there is on at this id or some flag defining that current id is empty. This is very fast but wasteful. You'll need O(N) memory.

The best solution in my opinion would be to use a set or a has-map. Which are basically the same because sets can be implemented using hash-map.

Here is a github project with c hash-map implementation. And stack overflow answer to a similar question.

answered Dec 01 '18 at 17:04

sleepyhead

419
1
5
19

Strange idea to trade sorting for a search tree, for which the complexity of the basic operations is O(log n), hence also exhibiting asymptotic complexity O(n log n) to handle the n elements, but with a much worse hidden constant. – Dec 01 '18 at 19:39
@YvesDaoust quote: "On average each particle has around 15 neighbours", O(n * log 15)=O(n) total or O(1) amortized per lookup – sleepyhead Dec 01 '18 at 20:13
The 15 elements are static, there is no benefit of using a binary tree. – Dec 01 '18 at 21:04
Yeap if they are only 15 and static you may be better off with array, but he mentioned on "average" then what size of array should it be? Or maybe linked list, but what if there are nodes with O(n) neighbors, this is why I offered a tree as partial solution. IDK all the details of the problem, anyway hash table will solve it the best. – sleepyhead Dec 01 '18 at 21:56
There is no problem storing all lists in a single array, with no gaps. – Dec 02 '18 at 09:08

efficient way to remove duplicate pairs of integers in C

1 Answers1