I am looking at a bigger code problem and trying to break down simple parts so I can understand them. I am now trying to understand the pandas query function. I have reproduced a small example for my learning.
import pandas as pd
df = pd.DataFrame()
df['nameA'] = ['Donald','Daffy','Minnie']
df['nameB'] = ['Donald','Daffy','Minnie']
df2 = df.query('nameA < nameB')
print(df2)
I am getting an empty dataframe, though I've seen something exactly similar done in a larger codebase. Could someone explain what flaw I have in my fundamental understanding?
I would like to follow this up by grouping by the two columns and getting all combinations of names, but without duplicates.
I am trying to analyze an exam question I had a couple weeks ago. There were two dataframes, movies and cast.
The task was as follows:
Create a dataframe called good_teamwork that contains four columns:
cast_member_1 and cast_member_2, the names of each pair of cast members that appear in the same movie;
num_movies, the number of movies that each pair of cast members appears in; and
avg_score, the average review score for each of those movies containing the two cast members.
Order the results by cast_member_1 alphabetically from A-Z, and break any ties by sorting by cast_member_2 alphabetically from A-Z. Round the result for avg_score to two (2) decimal places.
Remove duplicates.
The movies dataframe is large, but somewhat as follows:
id name score
0 9 Star Wars: Episode III - Revenge of the Sith 3D 61
1 24214 The Chronicles of Narnia: The Lion, The Witch ... 46
2 1789 War of the Worlds 94
3 10009 Star Wars: Episode II - Attack of the Clones 3D 28
4 771238285 Warm Bodies 3
The cast dataframe follows the following format:
movie_id cast_id cast_name
0 9 162652153 Hayden Christensen
1 9 162652152 Ewan McGregor
2 9 418638213 Kenny Baker
3 9 548155708 Graeme Blundell
4 9 358317901 Jeremy Bulloch
The solution code is as follows:
joined_df = cast.merge(cast, how='inner', left_on='movie_id',
right_on='movie_id')
joined_df = joined_df.query('cast_name_x < cast_name_y')
good_teamwork2 = joined_df.merge(movies, how='inner',
left_on='movie_id', right_on='id')
good_teamwork2 = good_teamwork2.groupby(['cast_name_x',
'cast_name_y']).agg({'movie_id': 'size', 'score':
'mean'}).reset_index()
good_teamwork2.columns = ['cast_member_1', 'cast_member_2',
'avg_score', 'num_movies']
good_teamwork2 = good_teamwork2[good_teamwork2['avg_score'] >= 50]
good_teamwork2 = good_teamwork2[good_teamwork2['num_movies'] >= 3]
good_teamwork2 = good_teamwork2.round({'avg_score': 2})
good_teamwork2 = good_teamwork2.sort_values(by=['cast_member_1',
'cast_member_2'], ascending=[True, True]).reset_index(drop=True)
good_teamwork2 = good_teamwork2[['cast_member_1', 'cast_member_2',
'num_movies', 'avg_score']]
I am mostly trying to understand how the query statement and the groupby statement with cast_name_x and cast_name_y get all combinations of actors without any duplicates. I also don't see where, for instance, cast_name_x was declared as a variable for use.