0

I'm trying to reproduce this code from python to R:

# Sort by user overall rating first
reviews = reviews.sort_values('review_overall', ascending=False)

# Keep the highest rating from each user and drop the rest 
reviews = reviews.drop_duplicates(subset= ['review_profilename','beer_name'], keep='first')

and I've done this piece of code in R:

reviews_df <-df[order(-df$review_overall), ]
library(dplyr)
df_clean <- distinct(reviews_df, review_profilename, beer_name, .keep_all= TRUE)

The problem is that I'm getting with python 1496263 records and with R 1496596 records.

link to dataset: dataset

Can someone help me to see my mistakes?

tucomax
  • 71
  • 1
  • 6
  • Does this answer your question? [Finding ALL duplicate rows, including "elements with smaller subscripts"](https://stackoverflow.com/questions/7854433/finding-all-duplicate-rows-including-elements-with-smaller-subscripts) – Himanshu Pingulkar Oct 22 '21 at 11:26
  • PLease share: https://stackoverflow.com/help/minimal-reproducible-example – deschen Oct 22 '21 at 12:54
  • thank you I will try to do it soon. Your answer show me that I can change my code. Thanks. – tucomax Oct 22 '21 at 14:02

1 Answers1

1

Without having some data, it's difficult to help, but you might be looking for:

library(tidyverse)
df_clean <- reviews_df %>%
  arrange(desc(review_overall)) %>%
  distinct(across(c(review_profilename, beer_name)), .keep_all = TRUE)

This code will sort descending by review_overall and look for every profilename + beer name combination and keep the first row (i.e. with highest review overall) for each of these combinations.

deschen
  • 10,012
  • 3
  • 27
  • 50