0

I have a DataFrame like this for 70581 rows

    id    created_at ... resource_id
230789    2017-01-19 ...         490
230722    2017-01-19 ...         514
   ...           ... ...         ...
312341    2017-08-27 ...         551

I want to get all possible pairs of resource_id column. If pair is repeated I want to increment the counter of a pair by 1. Result may be something like: (490,514) count 5.

I've tried to use list(itertools.combinations(df['resource_id'],2)) to get the pairs, but instead got MemoryError. How can I get what I want?

Asekeeewka
  • 35
  • 4
  • 1
    Take a look here: https://stackoverflow.com/questions/19384532/get-statistics-for-each-group-such-as-count-mean-etc-using-pandas-groupby – Lev Gelman Jan 10 '21 at 09:41
  • [Lev Gelman](https://stackoverflow.com/users/9080237/lev-gelman) That was helpful, but first of all I need to get those pairs to put them into separate columns to achieve what's written in that question. Thanks for help! – Asekeeewka Jan 10 '21 at 09:50
  • You cannot use an iterator for counts elements. The number of elements is hidden before you iterate them. A useful approach is grouping the rows by resource id and count the instances, then combine them, and multiply the counters. – Yanirmr Jan 10 '21 at 09:52
  • why use a list() around the generator? simply do not create the list in full – Patrick Artner Jan 10 '21 at 09:53
  • @Yanirmr My approach was first getting all the pairs with repititions, and then count every occurance putting it in dictionary, but I ran out of Memory. I'll try grouping – Asekeeewka Jan 10 '21 at 09:57
  • @Asekeeewka this is a very inefficient way. this is the reason for the memory issue you have, – Yanirmr Jan 10 '21 at 10:08

1 Answers1

0

You cannot use an iterator for counts elements. The number of elements is hidden before you iterate them.

In addition, the preferred approach in terms of efficiency is to first count each of the elements and then find the pairs. This saves you a lot of time because the pair share is in total multiplying the number of times each of the elements appears.

You can try something like that:

from collections import Counter
import itertools
MyList = [1,1,1,2,3,4,4] # you can insert here your df.series as list
a = dict(Counter(MyList))

combinations = itertools.combinations(list(a.keys()),2)
for i in combinations:
    print (i,a[i[0]]*a[i[1]])

The output is:

 (1, 2) 3
 (1, 3) 3
 (1, 4) 6
 (2, 3) 1
 (2, 4) 2
 (3, 4) 2
Yanirmr
  • 923
  • 8
  • 25