Match and Remove Rows Based on Condition R

Question

I've got an interesting one for you all.

I'm looking to first: Look through the ID column and identify duplicate values. Once those are identified, the code should go through the income of the duplicated values and keep the row with the larger income.

So if there are three ID values of 2, it will look for the one with the highest income and keep that row.

I know its as easy as subsetting based on a condition, but I don't know how to remove the rows based on if the income in one cell is greater than the other.(Only done if the id's match)

I was thinking of using an ifelse statement to create a new column to identify duplicates (through subsetting or not) then use the new column's values to ifelse again to identify the larger income. From there I can just subset based on the new columns I have created.

Is there a faster, more efficient way of doing this?

The outcome should look like this.

Thank you

score 3 · Accepted Answer · answered Sep 07 '18 at 16:26

We can slice the rows by checking the highest value in 'Income' grouped by 'ID'

library(dplyr)
df1 %>%
  group_by(ID) %>%
  slice(which.max(Income))

Or using data.table

library(data.table)
setDT(df1)[, .SD[which.max(Income)], by = ID]

Or with base R

df1[with(df1, ave(Income, ID, FUN = max) == Income),]
#     ID Income
#1   1  98765
#4   2   5498
#5   5     23
#6   6     98
#8   7  67871
#9   9 983754
#13 10   4744
#14 11   6853

data

df1 <- structure(list(ID = c(1L, 2L, 2L, 2L, 5L, 6L, 7L, 7L, 9L, 10L, 
10L, 10L, 10L, 11L), Income = c(98765L, 3456L, 67L, 5498L, 23L, 
98L, 5645L, 67871L, 983754L, 982L, 2374L, 875L, 4744L, 6853L)), 
class = "data.frame", row.names = c(NA, 
-14L))

score 3 · Answer 2 · answered Sep 07 '18 at 16:31

3

order with duplicated( Base R)

df=df[order(df$ID,-df$Income),]
df[!duplicated(df$ID),]
   ID Income
1   1  98765
4   2   5498
5   5     23
6   6     98
8   7  67871
9   9 983754
13 10   4744
14 11   6853

answered Sep 07 '18 at 16:31

BENY

317,841
20
164
234

score 3 · Answer 3 · answered Sep 09 '18 at 13:38

Here is another dplyr method. We can arrange the column and then slice the data frame for the first row.

library(dplyr)

df2 <- df %>%
  arrange(ID, desc(Income)) %>%
  group_by(ID) %>%
  slice(1) %>%
  ungroup()
df2
# # A tibble: 8 x 2
#      ID Income
#   <int>  <int>
# 1     1  98765
# 2     2   5498
# 3     5     23
# 4     6     98
# 5     7  67871
# 6     9 983754
# 7    10   4744
# 8    11   6853

DATA

df <- read.table(text = "ID Income
1   98765
2   3456
2   67
2   5498
5   23
6   98
7   5645
7   67871
9   983754
10  982
10  2374
10  875
10  4744
11  6853",
                 header = TRUE)

score 2 · Answer 4 · answered Sep 07 '18 at 16:44

Group_by and summarise from dplyr would work too

df1 %>% 
  group_by(ID) %>% 
  summarise(Income=max(Income))

     ID  Income
  <int>   <dbl>
1     1  98765.
2     2   5498.
3     5     23.
4     6     98.
5     7  67871.
6     9 983754.
7    10   4744.
8    11   6853.

score 2 · Answer 5 · answered Sep 07 '18 at 16:45

Using sqldf: Group by ID and select the corresponding max Income

library(sqldf)
sqldf("select ID,max(Income) from df group by ID")

Output:

  ID max(Income)
1  1       98765
2  2        5498
3  5          23
4  6          98
5  7       67871
6  9      983754
7 10        4744
8 11        6853

Match and Remove Rows Based on Condition R

5 Answers5

data