How to remove duplicated rows based on 3 columns for only one factor level?

Question

I have a list of 130 dataframes each with 27 columns and 2 factor levels per dataframe. I want to remove the duplicated rows in each dataframe based on 3 columns for one factor level only, keeping all rows in the other factor level and their duplicates.

I have sorted all the dataframes according to the factor levels and then I tried to remove the duplicated rows only for the first factor level.

The list is called x and i index between the dataframes in list with x[[i]], with i running from 1 to 130.

The column in every dataframe called temp contains 2 factor levels, either 0 or 1. The 130 dataframes have been ordered according to level = 0 first and then level=1.

for (i in 1:130)
{
x[[i]]$temp <- factor(x[[i]]$temp,levels = c(0,1)) 

# Creating 2 factor levels called `0` and `1` in column called `temp` and index position of the `temp` column is `24`

x[[i]] <- x[[i]][order(x[[i]]$temp),] 

# Ordering all of the dataframes by levels; level = 0 first then level = 1

x[[i]] <- x[[i]][!(duplicated(x[[i]][c(2,27,25)])),] 

# This is removing duplicated based on columns 2,27 and 25, but I to perform this only for the first factor level = 0
}

Can you please provide a reproducible (simulated) example? Note that you can format your question to include the code chunks. — Roman Luštrik, Aug 28 '19 at 07:13
Please read https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example and edit your question! — jogo, Aug 28 '19 at 07:27

score 1 · Accepted Answer · answered Aug 28 '19 at 08:36

For a single data frame, say df, you can do the following:

library(dplyr)
df %>% distinct(temp, 2, 27, 25, .keep_all = TRUE)

Note that you don't have to consider grouping on your factor, because if you have rows for both factors with repeated values for columns 2, 27 and 25, they are still two distinct columns.

The key here is the argument .keep_all, which keeps the remaining columns. Note however that if the remaining columns differ in some why, it is undetermined which rows you get back, you just get 1 row for each distinct combination of temp and columns 2, 27 and 25.

To expand to a list of data.frames, you can use lapply:

lapply(x, function(df) {
  df %>% distinct(temp, 2, 27, 25, .keep_all = TRUE)
}) %>% bind_rows(.id='date')

where the last call to bind_rows simply compresses everything into a single data frame, with the added .id argument to add a column named date whose values should be the entry names in your input list.

How to remove duplicated rows based on 3 columns for only one factor level?

1 Answers1