How to replicate python code to R to find duplicates?

Question

I'm trying to reproduce this code from python to R:

# Sort by user overall rating first
reviews = reviews.sort_values('review_overall', ascending=False)

# Keep the highest rating from each user and drop the rest 
reviews = reviews.drop_duplicates(subset= ['review_profilename','beer_name'], keep='first')

and I've done this piece of code in R:

reviews_df <-df[order(-df$review_overall), ]

library(dplyr)
df_clean <- distinct(reviews_df, review_profilename, beer_name, .keep_all= TRUE)

The problem is that I'm getting with python 1496263 records and with R 1496596 records.

link to dataset: dataset

Can someone help me to see my mistakes?

Does this answer your question? [Finding ALL duplicate rows, including "elements with smaller subscripts"](https://stackoverflow.com/questions/7854433/finding-all-duplicate-rows-including-elements-with-smaller-subscripts) — Himanshu Pingulkar, Oct 22 '21 at 11:26
PLease share: https://stackoverflow.com/help/minimal-reproducible-example — deschen, Oct 22 '21 at 12:54
thank you I will try to do it soon. Your answer show me that I can change my code. Thanks. — tucomax, Oct 22 '21 at 14:02

score 1 · Accepted Answer · answered Oct 22 '21 at 12:56

Without having some data, it's difficult to help, but you might be looking for:

library(tidyverse)
df_clean <- reviews_df %>%
  arrange(desc(review_overall)) %>%
  distinct(across(c(review_profilename, beer_name)), .keep_all = TRUE)

This code will sort descending by review_overall and look for every profilename + beer name combination and keep the first row (i.e. with highest review overall) for each of these combinations.

How to replicate python code to R to find duplicates?

1 Answers1