Find common column values based on another column

Question

I have a data frame which contains two columns UserId and movieId. Different users have watched different movies. I want to fetch (e.g. three common movies between two common users).

df = DataFrame({'userId' : [1,2,3,1,3,6,2,4,1,2], 'movieId' : [222,222,900,555,555,888,555,222,666,666]})

The required result should be like this

userId movieId
1       222
1       555
1       666
2       222
2       555
2       666

i do not need other entries which does not contain three common movies for users. For example, if there is another user who watched all three movies should be considered.

What exactly will be the arguments of the request ? Will you provide the number of movies to match ? A minimum number of movies to match ? I would use a dict with userId as key and a set of movieId as value. You then need to consider pairs of users and compute the intersection of their movies. If the size of the result is above the threshold (?), you found a result. — chmike, Nov 30 '19 at 14:04

jezrael · Answer 1 · 2019-12-01T10:20:57.667

2

Create indicators by get_dummies, then get 1 values per users by max and filter rows with N common values by boolean indexing with sum and DataFrame.all, last reshape by DataFrame.stack and convert MultiIndex to columns by Index.to_frame:

df1 = pd.get_dummies(df.set_index('userId')['movieId']).max(level=0)
df1 = df1[df1.sum(axis=1).ge(N)]
df1 = (df1.loc[:, df1.eq(1).all()]
          .iloc[:, :N]
          .rename_axis('movieId', axis=1))

df1 = pd.DataFrame({'userId': np.repeat(df1.index, len(df1.columns)),
                    'movieId': np.tile(df1.columns, len(df1.index))})

For test only 2 users is possible use another solution with pivot_table and dropna:

N = 3
df = df[df['userId'].isin([1,2])]
df1 = (df.pivot_table(index='userId',columns='movieId', aggfunc=len)
         .dropna(axis=1)
         .iloc[:, :N])
print (df1)
movieId  222  555  666
userId                
1          1    1    1
2          1    1    1

df1 = pd.DataFrame({'userId': np.repeat(df1.index, len(df1.columns)),
                    'movieId': np.tile(df1.columns, len(df1.index))})
print (df1)
   userId  movieId
0       1      222
1       1      555
2       1      666
3       2      222
4       2      555
5       2      666

edited Dec 01 '19 at 10:20

answered Nov 30 '19 at 14:01

jezrael

822,522
95
1,334
1,252

I have other columns corresponding to both of these columns. How can i fetch the data of those columns. – Sanwal Nov 30 '19 at 17:49
@muhammadsanwal - then use `df1 = df.drop_duplicates(['userId','movieId']).merge(df1)` – jezrael Nov 30 '19 at 17:51
i mean, i have other columns (e.g. ratings, timestamp). I want those columns values with df1 after all the above process. – Sanwal Nov 30 '19 at 18:02
@muhammadsanwal - yes, if remove duplicated in original data by `df.drop_duplicates(['userId','movieId'])` and merge `df1` from my solution get all columns, if need also duplicated rows use `df1 = df.merge(df1)` – jezrael Nov 30 '19 at 18:04
i am fetching (e.g for two random users 5 common movies). If `(length of df1 != length_of_movies * length_of_users)` it should again find other random users. for some iterations it works fine then i get this error `IndexError: list index out of range`. After tracing back it seems like the error is in this line `.rename_axis('movieId', axis=1)`. Can you help? – Sanwal Nov 30 '19 at 20:04
@muhammadsanwal - Can you add other users to sample data for debug error? – jezrael Dec 01 '19 at 05:37
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/203412/discussion-between-muhammad-sanwal-and-jezrael). – Sanwal Dec 01 '19 at 06:19
@muhammadsanwal - No easy find proble, but doen it ;) Problem was there was returned empty DataFrame, so `stack` function failed. So created final DataFrame another way, now working nice. – jezrael Dec 01 '19 at 09:53
@muhammadsanwal - And accepted answer is wrong, because checked only length of values, not common. test it by `df = pd.DataFrame({'userId' : [1,2,3,1,3,6,2,4,1,2], 'movieId' : [333,222,900,555,555,888,555,222,666,777]})` – jezrael Dec 01 '19 at 10:22
i tested updated version of your code it's giving the correct values but in movieId column it gives like that ```userId movieId 0 256 (rating, 163) 1 256 (rating, 260) 2 256 (rating, 318) 3 256 (rating, 370) 4 256 (rating, 377) 5 256 (rating, 520) 6 256 (rating, 593)``` – Sanwal Dec 01 '19 at 16:05
@muhammadsanwal - hmm, what is correct output from comment above? – jezrael Dec 01 '19 at 16:07
The answer which i accepted is applicable. it is giving the count of 2 if movie is watched by two users. i can fetch all those movies with count 2 (in other means those movies are common in both of users). What do you think? – Sanwal Dec 01 '19 at 16:11
(rating, 260) 260 is actually movieId but why it is showing like this (rating, 260)? – Sanwal Dec 01 '19 at 16:13
@muhammadsanwal - interesting, it seems some forgotten variable, not idea. – jezrael Dec 01 '19 at 16:14

score 1 · Accepted Answer · answered Nov 30 '19 at 15:00

Here's one that does some dataframe manipulation.

Set some variables:

n_common_movies = 3 n_users = 2
Create a column with movie groups:

df1 = df.groupby('userId')['movieId'].apply(list).reset_index(name='movies')

Output:

df1

   userId           movies
0       1  [222, 555, 666]
1       2  [222, 555, 666]
2       3       [900, 555]
3       4            [222]
4       6            [888]

Reduce that list to the number of movies that equal n_common_movies since that's the total number we want. If that isn't met, then we can skip the rest of the process.

df2 = df1.loc[df1['movies'].apply(lambda x: len(x))== n_common_movies,:]

Output:

df2

   userId           movies
0       1  [222, 555, 666]
1       2  [222, 555, 666]

Use pd.explode() to "stack" the results of step 2:

df3 = df2.explode('movies')

Output:

df3

   userId movies
0       1    222
0       1    555
0       1    666
1       2    222
1       2    555
1       2    666

Create another grouping to get a view count for each movie:

df4 = df3.groupby('movies').size().reset_index(name='viewer_count')

Output:

df4

   movies  viewer_count
0     222             2
1     555             2
2     666             2

Finally, check to see that a filtered result based on expected number of users is equal to the length of the number of common movies and print the...userId, I guess. Print whatever you want, lol.

if len(df4[df4['viewer_count'] == n_users]) == n_common: tmp = '\n\t'.join([str(i) for i in list(set(df3['userId']))]) print('Users with three common movies: \n\t{}'.format(tmp))

Output:

Users with three common movies: 1 2

Dani Mesejo · Answer 3 · 2019-11-30T14:31:18.693

I think is best if you defined a function to fetch k common movies between two users u and v, for example:

def common_movies(d, u, v, k=3):
    """Fetch common movies between users u and v"""

    # create filter by the specified users
    mask = d['userId'].isin((u, v))

    # group by movieId, aggregate into a list and then explode on userId
    values = d[mask].groupby('movieId').agg({'userId': list}).explode('userId')

    # filter by the first k movies
    return values.loc[values.index.unique()[:k]].sort_values('userId').reset_index()


print(common_movies(df, 1, 2))

Output

   movieId userId
0      222      1
1      555      1
2      666      1
3      222      2
4      555      2
5      666      2

Notice that the in the above function the default value is 3, as specified, also this function is robust in the sense that it will not fail if there aren't the number of specified values, for example if you remove movie 222 it returns:

   movieId userId
0      555      1
1      666      1
2      555      2
3      666      2

AttributeError: 'DataFrame' object has no attribute 'explode. I getting this error. — Sanwal, Nov 30 '19 at 15:38
What version of pandas are you using? See [this](https://stackoverflow.com/questions/53218931/how-to-unnest-explode-a-column-in-a-pandas-dataframe) to explode a DataFrame before pandas 0.25 — Dani Mesejo, Nov 30 '19 at 15:40

Find common column values based on another column

3 Answers3