2

I have a dataframe like this,

>>> import pandas as pd
>>> data = {
    'user_id': [1, 1, 1, 2, 2, 3, 3, 4, 4, 4],
    'movie_id': [0, 1, 2, 0, 1, 2, 3, 2, 3, 4]
}
>>> df = pd.DataFrame(data)
>>> df
   user_id  movie_id
0        1         0
1        1         1
2        1         2
3        2         0
4        2         1
5        3         2
6        3         3
7        4         2
8        4         3
9        4         4

I wonder how many people liked the second movie after they liked the first movie. Or liked the third movie after you liked the second movie. Etc. Here is my expected output,

[[0., 2., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 2., 0.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0.]]

For instance, movie_id=1 liked two times after they liked movie_id=0, so matrix[0][1]=2 and matrix[1][0]=2. OK, how I found this result? user_id=1 liked movie_id=0, movie_id=1 and movie_id=2 by respectively. Also, user_id=2 liked movie_id=0 and movie_id=1 by respectively. So, matrix[0][1]=2

I tried this one, that returns incorrect output and very slow working in big dataframe.

import numpy as np

item = dict()

def cross(a):
    for i in a:
        for j in a:
            if i == j:
                continue
            if (i, j) in item.keys():
                item[(i, j)] += 1
            else:
                item[(i, j)] = 1

_ = df.groupby('user_id')['movie_id'].apply(cross)

length = df['movie_id'].nunique()
res = np.zeros([length, length])

for k, v in item.items():
    res[k] = v

Any idea? Thanks in advance.

E. Zeytinci
  • 2,642
  • 1
  • 20
  • 37
  • Have you read the Pandas docs? – AMC Jan 07 '20 at 21:32
  • Yeap, but I couldn't find what I was looking for. Do you know any part that can help? – E. Zeytinci Jan 07 '20 at 21:35
  • what do you mean by big dataframe? 10 different movies and millions of users, 1000 different movies and 1000 users? what is the maximum number of films for one user? this may change the way to solve your problem – Ben.T Jan 07 '20 at 21:53
  • Data contains 47,396 unique movies 162,414 unique users and 15,630,129 rows. And the maximum number of votes cast by a user is 12,802 and the average number of votes cast is 96. – E. Zeytinci Jan 07 '20 at 21:59
  • but `movie_id == 0` was **NOT** liked *AFTER* `movie_id == 1` so why is `matrix[1][0] == 2`? Or is the word *AFTER* not being used appropriately here? – piRSquared Jan 07 '20 at 22:06
  • You are absolutely right @piRSquared. I updated my question. So sorry. – E. Zeytinci Jan 07 '20 at 22:23

2 Answers2

2

You can do the following:

# add row_numbers as a column
df.reset_index(inplace=True) 

# merge df on itself
df2 = df.merge(df, how='inner', on='user_id')

# remove some entries, keep only pairs where movie_id_x was liked before movie_id_y
df2 = df2[df2['index_x']<df2['index_y']].drop(['index_x','index_y'], axis=1)

# use pivot table to make matrix
df3 = df2.pivot_table(index='movie_id_x',columns='movie_id_y', values='user_id', aggfunc='count')

# UPD: add empty rows for movies which were removed
ids = df['movie_id'].unique()
df3 = df3.reindex(ids)
df3 = df3.reindex(ids, axis=1)
df3 = df3.fillna(0)

# convert result from dataframe to array if necessary
res = np.array(df3)

Result:

print(res)
[[0 2 1 0 0]
 [0 0 1 0 0]
 [0 0 0 2 1]
 [0 0 0 0 1]
 [0 0 0 0 0]]

Faster version (with sparse matrix)

The idea is that your matrix is actually sparse and it takes a lot of memory to store it in dense form (especially in form of pandas dataframe). So it is reasonable to store it like sparse matrix. Approach was found here.

# add row_numbers as a column
df.reset_index(inplace=True) 

# merge df on itself
df2 = df.merge(df, how='inner', on='user_id')

# remove some entries, keep only pairs where movie_id_x was liked before movie_id_y
df2 = df2[df2['index_x']<df2['index_y']].drop(['index_x','index_y'], axis=1)

# use groupby to count movie pairs
df2 = df2.groupby(['movie_id_x','movie_id_y'])['user_id'].count().reset_index()

# create pivot as sparse matrix
movies_t = CategoricalDtype(sorted(df['movie_id'].unique()), ordered=True)
row = df2['movie_id_x'].astype(movies_t).cat.codes
col = df2['movie_id_y'].astype(movies_t).cat.codes
sparse_matrix = csr_matrix((df2["user_id"], (row, col)), \
                       shape=(movies_t.categories.size, movies_t.categories.size))

# convert sparse to dense if needed
res = sparse_matrix.todense()
Stepan
  • 1,044
  • 1
  • 5
  • 9
2

Update for changed question per comment below:

Using merge and pd.crosstab with np.triu:

df_m = df.merge(df, on='user_id').query('movie_id_x <= movie_id_y')
arr = pd.crosstab(df_m['movie_id_x'], df_m['movie_id_y']).to_numpy()
arr = np.triu(arr, 1)
arr

Output:

array([[0, 2, 1, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 2, 1],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0]], dtype=int64)

Using itertools and combinations:

from itertools import combinations

s = df.groupby('user_id')['movie_id'].apply(lambda x: pd.Series(combinations(x,2)))\
      .value_counts()

arr = pd.DataFrame(s, index=pd.MultiIndex.from_tuples(s.index))['movie_id'].unstack()\
        .reindex(index=[0,1,2,3,4], columns=[0,1,2,3,4]).fillna(0).to_numpy()

arr

Output:

array([[0., 2., 1., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 2., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0.]])

Update using @Stepan approach, using merge, and pd.crosstab:

df_m = df.merge(df, on='user_id').query('movie_id_x != movie_id_y')
pd.crosstab(df_m['movie_id_x'], df_m['movie_id_y']).to_numpy()

Output:

array([[0, 2, 1, 0, 0],
       [2, 0, 1, 0, 0],
       [1, 1, 0, 2, 1],
       [0, 0, 2, 0, 1],
       [0, 0, 1, 1, 0]], dtype=int64)

You can try this:

from itertools import permutations

s = df.groupby('user_id')['movie_id'].apply(lambda x: pd.Series(permutations(x,2)))\
      .value_counts()

arr = pd.DataFrame(s, index=pd.MultiIndex.from_tuples(s.index))['movie_id'].unstack()\
        .reindex(index=[0,1,2,3,4], columns=[0,1,2,3,4]).fillna(0).to_numpy()

arr

Output:

array([[0., 2., 1., 0., 0.],
       [2., 0., 1., 0., 0.],
       [1., 1., 0., 2., 1.],
       [0., 0., 2., 0., 1.],
       [0., 0., 1., 1., 0.]])
Scott Boston
  • 147,308
  • 15
  • 139
  • 187