Here's a sample of the database I'm working with :
{'type': {0: 'TV Show', 1: 'Movie', 2: 'Movie', 3: 'Movie', 4: 'Movie'},
'director': {0: nan,
1: 'Jorge Michel Grau',
2: 'Gilbert Chan',
3: 'Shane Acker',
4: 'Robert Luketic'},
'country': {0: 'Brazil',
1: 'Mexico',
2: 'Singapore',
3: 'Poland, United States',
4: 'Norway, Poland, United States'},
'rating': {0: 'TV-MA', 1: 'TV-MA', 2: 'R', 3: 'PG-13', 4: 'PG-13'}}
What I'm trying to do is count how many times two countries (even more than two if possible) collaborated on a movies/tv show. Using the provided sample, I'd find that the USA and Poland collaborated twice, USA and Norway once, and no collaborations for the rest of the countries. Here's the code that I managed to make :
# This function would count the number of occurrences of each country in the column
def count(data, column) :
return Counter([thing.strip() for thing in data[column].fillna('missing') for thing in thing.split(',')])
# And this one would count the occurrences of couples of countries together
def count_tuple(data, column) :
a, b = zip(*count(data, column).most_common())
s = pd.DataFrame(columns=a, index=a)
for l in a :
mask = data[column].fillna('missing').apply(lambda z: l in z)
df = data[mask]
c, d = zip(*count(df, column).most_common())
for k in c :
if k !='' :
occur = count(df, column)[k]
s.loc[l,k] = occur
return s.fillna(0)
This function would return a dataframe
with the occurrence of each couple of countries. As usual I don't think this method is efficient, Could there be another way to do this ?