0

Here's a sample of the database I'm working with :

{'type': {0: 'TV Show', 1: 'Movie', 2: 'Movie', 3: 'Movie', 4: 'Movie'},
 'director': {0: nan,
  1: 'Jorge Michel Grau',
  2: 'Gilbert Chan',
  3: 'Shane Acker',
  4: 'Robert Luketic'},
 'country': {0: 'Brazil',
  1: 'Mexico',
  2: 'Singapore',
  3: 'Poland, United States',
  4: 'Norway, Poland, United States'},
 'rating': {0: 'TV-MA', 1: 'TV-MA', 2: 'R', 3: 'PG-13', 4: 'PG-13'}}

What I'm trying to do is count how many times two countries (even more than two if possible) collaborated on a movies/tv show. Using the provided sample, I'd find that the USA and Poland collaborated twice, USA and Norway once, and no collaborations for the rest of the countries. Here's the code that I managed to make :

# This function would count the number of occurrences of each country in the column
def count(data, column) :
    return Counter([thing.strip() for thing in data[column].fillna('missing') for thing in thing.split(',')])

# And this one would count the occurrences of couples of countries together
def count_tuple(data, column) :
    a, b = zip(*count(data, column).most_common())
    s = pd.DataFrame(columns=a, index=a)
    
    for l in a :
        mask = data[column].fillna('missing').apply(lambda z: l in z)
        df = data[mask]
        c, d = zip(*count(df, column).most_common())
        for k in c :
            if k !='' :
                occur = count(df, column)[k]
                s.loc[l,k] = occur
    return s.fillna(0)

This function would return a dataframe with the occurrence of each couple of countries. As usual I don't think this method is efficient, Could there be another way to do this ?

1 Answers1

0

Here is solution for all combinations of country columns values in nested list comprehension with hashable sets called forozensets for count with not important order, it means (United States, Poland) is same like (Poland, United States) for count:

from itertools import chain, combinations
from collections import Counter

#https://stackoverflow.com/a/5898031/2901002
def all_subsets(ss):
    return list(chain(*map(lambda x: combinations(ss, x), range(2, len(ss)+1))))

L = [frozenset(y) for x in df['country'].fillna('missing') 
                  for y in all_subsets(x.split(', '))]
print (L)

out = Counter(L)

df = pd.DataFrame({'col1':out.keys(), 'col2': out.values()})
print (df)
                              col1  col2
0          (United States, Poland)     2
1                 (Norway, Poland)     1
2          (United States, Norway)     1
3  (United States, Norway, Poland)     1

Or you can sorting values to tuples:

from itertools import chain, combinations
from collections import Counter

def all_subsets(ss):
    return list(chain(*map(lambda x: combinations(ss, x), range(2, len(ss)+1))))

L = [tuple(sorted(y)) for x in df['country'].fillna('missing') 
                  for y in all_subsets(x.split(', '))]
print (L)

out = Counter(L)

df = pd.DataFrame({'col1':out.keys(), 'col2': out.values()})
print (df)
                              col1  col2
0          (Poland, United States)     2
1                 (Norway, Poland)     1
2          (Norway, United States)     1
3  (Norway, Poland, United States)     1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252