0

I have a data frame containing the customers ratings of the restaurants they went to and few other attributes.

  • What i want to do is to calculate the difference between the average star rating for the last year and the average star rating for the first year of a restaurant.

data = {'rating_id': ['1', '2','3','4','5','6','7'],
        'user_id': ['56', '13','56','99','99','13','12'],
        'restaurant_id':  ['xxx', 'xxx','yyy','yyy','xxx','zzz','zzz'],
        'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','1.0'],
        'rating_year': ['2012','2012','2020','2001','2020','2015','2000'],
        'first_year': ['2012', '2012','2001','2001','2012','2000','2000'],
        'last_year': ['2020', '2020','2020','2020','2020','2015','2015'],
        }

df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])

df.head()

df['star_rating'] = df['star_rating'].astype(float)

# calculate the average of the stars of the first year 

ratings_mean_firstYear= df.groupby(['restaurant_id','first_year']).agg({'star_rating':[np.mean]})
ratings_mean_firstYear.columns = ['avg_firstYear']
ratings_mean_firstYear.reset_index()

# calculate the average of the stars of the last year 

ratings_mean_lastYear= df.groupby(['restaurant_id','last_year']).agg({'star_rating':[np.mean]})
ratings_mean_lastYear.columns = ['avg_lastYear']
ratings_mean_lastYear.reset_index()

# merge the means into a single table

ratings_average = ratings_mean_firstYear.merge(
    ratings_mean_lastYear.groupby('restaurant_id')['avg_lastYear'].max()
    , on='restaurant_id'
)

ratings_average.head(20)

enter image description here

My problem is that the averages of the first and last years are the exact same and that makes no sens, i don't really know what i did wrong in my thought process here..i suspect something is going on with .agg since it's the first time i use pandas lib.

Any suggestions?

Lynn
  • 121
  • 8
  • 25
  • Does this answer your question? [Aggregation in pandas](https://stackoverflow.com/questions/53781634/aggregation-in-pandas) – woblob Sep 24 '20 at 19:45
  • not really tbh. – Lynn Sep 24 '20 at 19:50
  • 1
    There is a logical error : you apply your aggregation on the same column stars. You need to have stars for the first and the last year, but apply the aggregation on the same stars. So the result is just fine. Create columns with `stars first` and `stars last` and it will work. – Oliver Prislan Sep 24 '20 at 20:05
  • @OliverPrislan ohh i will try that! – Lynn Sep 24 '20 at 20:07
  • Your data is provided in such a way that it has single rating per user/restaurant pair and you use it in both first and last year aggregation - so naturally it is equal for both years. I'd first filter the data using `rating_year == first_year` criteria and then apply groupby and agg. Then repeat same for the last year and then merge 2 results. In your example there is not a single review, whose data matches first or last year of any restaurant. So to show proper example would require more data. I assume that you have it in your larger dataframe. – Poe Dator Sep 24 '20 at 20:10

1 Answers1

1

Your data is provided in such a way that it has single rating per user/restaurant pair and you use it in both first and last year aggregation - so naturally it is equal for both years. I'd first filter the data using rating_year == first_year criteria and then apply groupby and agg. Then repeat same for the last year and then merge 2 results. In your example there is not a single review, whose data matches first or last year of any restaurant. So to show proper example would require more data. I assume that you have it in your larger dataframe. –

Here is an example, I added more lines and changed years to have more matches:

data = {'rating_id': ['1', '2','3','4','5','6','7','8','9'],
        'user_id': ['56', '56','56','56', '99','99','99','99','99'],
        'restaurant_id':  ['xxx', 'xxx','yyy','yyy','xxx', 'xxx','yyy','yyy','xxx'],
        'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','4.0','2.5','3.0'],
        'rating_year': ['2012', '2020','2001','2020', '2012', '2020','2001','2020','2019'],
        'first_year': ['2012', '2012','2001','2001','2012', '2012','2001','2001','2012'],
        'last_year': ['2020', '2020','2020','2020','2020','2020','2020','2020','2020'],
        }

df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])
df['star_rating'] = df['star_rating'].astype(float)

ratings_mean_firstYear = df[df.rating_year == df.first_year].groupby('restaurant_id').agg({'star_rating':'mean'})
ratings_mean_firstYear.columns = ['avg_firstYear']
ratings_mean_lastYear= df[df.rating_year == df.last_year].groupby('restaurant_id').agg({'star_rating':'mean'})
ratings_mean_lastYear.columns = ['avg_lastYear']

result:

ratings_mean_firstYear.merge(ratings_mean_lastYear, left_index=True, right_index=True)

               avg_firstYear  avg_lastYear
restaurant_id                             
xxx                     1.65          3.45
yyy                     2.60          3.75
Poe Dator
  • 4,535
  • 2
  • 14
  • 35