Pandas : How to use .agg()

Question

I have a data frame containing the customers ratings of the restaurants they went to and few other attributes.

What i want to do is to calculate the difference between the average star rating for the last year and the average star rating for the first year of a restaurant.


data = {'rating_id': ['1', '2','3','4','5','6','7'],
        'user_id': ['56', '13','56','99','99','13','12'],
        'restaurant_id':  ['xxx', 'xxx','yyy','yyy','xxx','zzz','zzz'],
        'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','1.0'],
        'rating_year': ['2012','2012','2020','2001','2020','2015','2000'],
        'first_year': ['2012', '2012','2001','2001','2012','2000','2000'],
        'last_year': ['2020', '2020','2020','2020','2020','2015','2015'],
        }

df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])

df.head()

df['star_rating'] = df['star_rating'].astype(float)

# calculate the average of the stars of the first year 

ratings_mean_firstYear= df.groupby(['restaurant_id','first_year']).agg({'star_rating':[np.mean]})
ratings_mean_firstYear.columns = ['avg_firstYear']
ratings_mean_firstYear.reset_index()

# calculate the average of the stars of the last year 

ratings_mean_lastYear= df.groupby(['restaurant_id','last_year']).agg({'star_rating':[np.mean]})
ratings_mean_lastYear.columns = ['avg_lastYear']
ratings_mean_lastYear.reset_index()

# merge the means into a single table

ratings_average = ratings_mean_firstYear.merge(
    ratings_mean_lastYear.groupby('restaurant_id')['avg_lastYear'].max()
    , on='restaurant_id'
)

ratings_average.head(20)

My problem is that the averages of the first and last years are the exact same and that makes no sens, i don't really know what i did wrong in my thought process here..i suspect something is going on with .agg since it's the first time i use pandas lib.

Any suggestions?

Does this answer your question? [Aggregation in pandas](https://stackoverflow.com/questions/53781634/aggregation-in-pandas) — woblob, Sep 24 '20 at 19:45
There is a logical error : you apply your aggregation on the same column stars. You need to have stars for the first and the last year, but apply the aggregation on the same stars. So the result is just fine. Create columns with `stars first` and `stars last` and it will work. — Oliver Prislan, Sep 24 '20 at 20:05
Your data is provided in such a way that it has single rating per user/restaurant pair and you use it in both first and last year aggregation - so naturally it is equal for both years. I'd first filter the data using `rating_year == first_year` criteria and then apply groupby and agg. Then repeat same for the last year and then merge 2 results. In your example there is not a single review, whose data matches first or last year of any restaurant. So to show proper example would require more data. I assume that you have it in your larger dataframe. — Poe Dator, Sep 24 '20 at 20:10

Poe Dator · Accepted Answer · 2020-09-24T20:44:09.423

Your data is provided in such a way that it has single rating per user/restaurant pair and you use it in both first and last year aggregation - so naturally it is equal for both years. I'd first filter the data using rating_year == first_year criteria and then apply groupby and agg. Then repeat same for the last year and then merge 2 results. In your example there is not a single review, whose data matches first or last year of any restaurant. So to show proper example would require more data. I assume that you have it in your larger dataframe. –

Here is an example, I added more lines and changed years to have more matches:

data = {'rating_id': ['1', '2','3','4','5','6','7','8','9'],
        'user_id': ['56', '56','56','56', '99','99','99','99','99'],
        'restaurant_id':  ['xxx', 'xxx','yyy','yyy','xxx', 'xxx','yyy','yyy','xxx'],
        'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','4.0','2.5','3.0'],
        'rating_year': ['2012', '2020','2001','2020', '2012', '2020','2001','2020','2019'],
        'first_year': ['2012', '2012','2001','2001','2012', '2012','2001','2001','2012'],
        'last_year': ['2020', '2020','2020','2020','2020','2020','2020','2020','2020'],
        }

df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])
df['star_rating'] = df['star_rating'].astype(float)

ratings_mean_firstYear = df[df.rating_year == df.first_year].groupby('restaurant_id').agg({'star_rating':'mean'})
ratings_mean_firstYear.columns = ['avg_firstYear']
ratings_mean_lastYear= df[df.rating_year == df.last_year].groupby('restaurant_id').agg({'star_rating':'mean'})
ratings_mean_lastYear.columns = ['avg_lastYear']

result:

ratings_mean_firstYear.merge(ratings_mean_lastYear, left_index=True, right_index=True)

               avg_firstYear  avg_lastYear
restaurant_id                             
xxx                     1.65          3.45
yyy                     2.60          3.75

ohh yeah thats true, well i can tweak the rating_year to match those in my example,sorry.. i think i understand my logical error — Lynn, Sep 24 '20 at 20:17
no problems - happens with everyone. Pls accept and upvote if done. — Poe Dator, Sep 24 '20 at 20:23
could you if possible add an example on how you'd filer the data with that criteria to apply the agg? i updated my data — Lynn, Sep 24 '20 at 20:24

Pandas : How to use .agg()

1 Answers1