I have a dataframe like this,
>>> import pandas as pd
>>> data = {
'user_id': [1, 1, 1, 2, 2, 3, 3, 4, 4, 4],
'movie_id': [0, 1, 2, 0, 1, 2, 3, 2, 3, 4]
}
>>> df = pd.DataFrame(data)
>>> df
user_id movie_id
0 1 0
1 1 1
2 1 2
3 2 0
4 2 1
5 3 2
6 3 3
7 4 2
8 4 3
9 4 4
I wonder how many people liked the second movie after they liked the first movie. Or liked the third movie after you liked the second movie. Etc. Here is my expected output,
[[0., 2., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 2., 0.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0.]]
For instance, movie_id=1
liked two times after they liked movie_id=0
, so matrix[0][1]=2
and matrix[1][0]=2
. OK, how I found this result? user_id=1
liked movie_id=0
, movie_id=1
and movie_id=2
by respectively. Also, user_id=2
liked movie_id=0
and movie_id=1
by respectively. So, matrix[0][1]=2
I tried this one, that returns incorrect output and very slow working in big dataframe.
import numpy as np
item = dict()
def cross(a):
for i in a:
for j in a:
if i == j:
continue
if (i, j) in item.keys():
item[(i, j)] += 1
else:
item[(i, j)] = 1
_ = df.groupby('user_id')['movie_id'].apply(cross)
length = df['movie_id'].nunique()
res = np.zeros([length, length])
for k, v in item.items():
res[k] = v
Any idea? Thanks in advance.