0

I am looking at a bigger code problem and trying to break down simple parts so I can understand them. I am now trying to understand the pandas query function. I have reproduced a small example for my learning.

import pandas as pd

df = pd.DataFrame()

df['nameA'] = ['Donald','Daffy','Minnie']

df['nameB'] = ['Donald','Daffy','Minnie']

df2 = df.query('nameA < nameB')

print(df2)

I am getting an empty dataframe, though I've seen something exactly similar done in a larger codebase. Could someone explain what flaw I have in my fundamental understanding?

I would like to follow this up by grouping by the two columns and getting all combinations of names, but without duplicates.

I am trying to analyze an exam question I had a couple weeks ago. There were two dataframes, movies and cast.

The task was as follows:

Create a dataframe called good_teamwork that contains four columns:

cast_member_1 and cast_member_2, the names of each pair of cast members that appear in the same movie;
num_movies, the number of movies that each pair of cast members appears in; and
avg_score, the average review score for each of those movies containing the two cast members.

Order the results by cast_member_1 alphabetically from A-Z, and break any ties by sorting by cast_member_2 alphabetically from A-Z. Round the result for avg_score to two (2) decimal places.

Remove duplicates.

The movies dataframe is large, but somewhat as follows:

id   name                                                score
0   9   Star Wars: Episode III - Revenge of the Sith 3D         61
1   24214   The Chronicles of Narnia: The Lion, The Witch ...   46
2   1789    War of the Worlds   94
3   10009   Star Wars: Episode II - Attack of the Clones 3D     28
4   771238285   Warm Bodies                                      3

The cast dataframe follows the following format:

movie_id  cast_id   cast_name
0   9   162652153   Hayden Christensen
1   9   162652152   Ewan McGregor
2   9   418638213   Kenny Baker
3   9   548155708   Graeme Blundell
4   9   358317901   Jeremy Bulloch

The solution code is as follows:

joined_df = cast.merge(cast, how='inner', left_on='movie_id', 
right_on='movie_id')
joined_df = joined_df.query('cast_name_x < cast_name_y')
good_teamwork2 = joined_df.merge(movies, how='inner', 
left_on='movie_id', right_on='id')
good_teamwork2 = good_teamwork2.groupby(['cast_name_x', 
'cast_name_y']).agg({'movie_id': 'size', 'score': 
'mean'}).reset_index()
good_teamwork2.columns = ['cast_member_1', 'cast_member_2', 
'avg_score', 'num_movies']
good_teamwork2 = good_teamwork2[good_teamwork2['avg_score'] >= 50]
good_teamwork2 = good_teamwork2[good_teamwork2['num_movies'] >= 3]
good_teamwork2 = good_teamwork2.round({'avg_score': 2})
good_teamwork2 = good_teamwork2.sort_values(by=['cast_member_1', 
'cast_member_2'], ascending=[True, True]).reset_index(drop=True)
good_teamwork2 = good_teamwork2[['cast_member_1', 'cast_member_2', 
'num_movies', 'avg_score']]

I am mostly trying to understand how the query statement and the groupby statement with cast_name_x and cast_name_y get all combinations of actors without any duplicates. I also don't see where, for instance, cast_name_x was declared as a variable for use.

wayneeusa
  • 194
  • 1
  • 10
  • 1
    What is expected final output? What is reason for compare strings columns by `<` ? – jezrael Nov 18 '18 at 11:35
  • So my first idea is use outer join first with remove same values in both columns by `joined_df = cast.merge(cast, how='outer', left_on='movie_id', right_on='movie_id')` and `joined_df = joined_df.query('cast_name_x != cast_name_y')`, but not 100% sure. Is possible create [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve), maybe also help [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – jezrael Nov 18 '18 at 12:21
  • Sure. But also the posted solution works. It's largely a matter of understanding it. For instance, why it uses cast_name_x < cast_name_y – wayneeusa Nov 18 '18 at 12:34
  • hmmm, hard question. Because if check link in question it compare each row of `cast_name_x` and `cast_name_y` lexicographically. And for strings it is rare, obviously are compared for equal `==` and not equal `!=`. I try create sample `cast = pd.DataFrame({'cast_name': ['Donald','Daffy','Minnie', 'Donald'], 'movie_id':[1,2,2,2]})` and check `joined_df = cast.merge(cast, how='inner', left_on='movie_id', right_on='movie_id') joined_df = joined_df.query('cast_name_x < cast_name_y')` but still not understand logic compare by `<`. – jezrael Nov 18 '18 at 12:52

1 Answers1

1

You can compare strings columns with less operator, but obviously it have no reason.

print(df)
    nameA   nameB
0  Donald  Donald
1   Daffy   Daffy
2  Minnie  Minnie

Alternative solution with same output is use boolean indexing with boolean mask - here is possible see comparison return only False values, so output is empty DataFrame:

mask = df['nameA'] < df['nameB']
print (mask)
0    False
1    False
2    False
dtype: bool

df2 = df[mask]
print (df2)
Empty DataFrame
Columns: [nameA, nameB]
Index: []

df2 = df.query('nameA < nameB')
print(df2)
Empty DataFrame
Columns: [nameA, nameB]
Index: []
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252