suppose I have a very big data.frame/collection with the field id and an proxy id.
1 a
2 a
1 b
3 b
1 c
3 c
4 d
Now I'd like to count/get the matches which id has with another id.
1 2 1 #id1 id2 count
1 3 2
Ok with some python itertools.combinations and lookups this works. But feels cumbersome. Is there an more approriate simple fast approach/technology?
My approach later appended:
I filtered the ids which are appear > x , beacuse I have millions.
def matchings(id): #mapping is the mongodb collection match = mapping.find({'id':id}) valid_proxies = [doc['proxy'] for doc in match] other_ids = [doc['id'] for doc in mapping.find({'proxy': {'$in':valid_proxies}})] c = Counter([(id, id2) for id2 in other_ids if id2 !=id]) #possible filter #c_filtered {k:v for k, v in c.items() if v > 3 } #some stats #s1 = [id,len(proxies),len(other_ids)] s2 = [[k[0],k[1],v] for k,v in c.items()] return s2
res = [matchings(id) for id in list(df_id_filtered['id'])] df_final_matching_counts = pd.DataFrame(list(itertools.chain(*res)))
Thanks!