Replacing value in a row based on value in another row with same identifier using R

Question

I am trying to solve a problem in R that seems simple, but I can't work it out.

I have the following data:

tmp

  town_id city_ id flag
1   10500      111    1           
2   15300     1110    1           
3    9400    11100    1           
4    9400    11101    0           
5    9600    11102    0          
6    9800    11103    0

There is a duplicate town_id, and I would like to remove it while assigning the highest value in flag. That is, I would like to have:

  town_id city_ id flag
1   10500      111    1           
2   15300     1110    1           
3    9400    11100    1                     
4    9600    11102    0          
5    9800    11103    0

I tried to use the following dplyr code, but it assigns a 1 to everything:

tmp_2<-tmp %>% group_by(town_id) %>% mutate(flag=max(flag))

tmp_2

  town_id city_ id  flag 
1 10500   111          1           
2 15300   1110         1           
3 9400    11100        1           
4 9400    11101        1           
5 9600    11102        1           
6 9800    11103        1

Could someone please tell me what I am doing wrong?

Thank you.

Perhaps a case of [Why does summarize or mutate not work with group_by when I load `plyr` after `dplyr`?](https://stackoverflow.com/questions/26106146/why-does-summarize-or-mutate-not-work-with-group-by-when-i-load-plyr-after-dp) (and don't you want `summarize`?) — Henrik, May 22 '20 at 16:21
Could you specify more explicitly how you would want `city_id` to be treated? Should the lowest value be used? What if there are more than two duplicates with the same flag? — coffeinjunky, May 22 '20 at 16:58
Thank you. I did not think about this, but I would want the lowest value for city_id. — Cola, May 22 '20 at 17:28

score 0 · Answer 1 · answered May 22 '20 at 16:27

0

Its hacky, but I think it should work:

    tmp = arrange(tmp,-flag) 
    tmp = tmp[!(tmp$town_id %>% duplicated),]

answered May 22 '20 at 16:27

dvd280

876
6
11

user12728748 · Accepted Answer · 2020-05-22T17:39:43.063

I suppose this is what you want - filter out the duplicate:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
tmp <- structure(list(town_id = c(10500L, 15300L, 9400L, 9400L, 9600L, 9800L), 
                      city_id = c(111L, 1110L, 11100L, 11101L, 11102L, 11103L), 
                      flag = c(1L, 1L, 1L, 0L, 0L, 0L)), 
                 class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6"))
tmp_2 <- tmp %>% group_by(town_id) %>% dplyr::filter(flag==max(flag)) %>% ungroup()
tmp_2
#> # A tibble: 5 x 3
#>   town_id city_id  flag
#>     <int>   <int> <int>
#> 1   10500     111     1
#> 2   15300    1110     1
#> 3    9400   11100     1
#> 4    9600   11102     0
#> 5    9800   11103     0

Edit: In case there are more than one per group with the maximum value of flag and you only want to retain the first one, you could do:

tmp_2 <- tmp %>% group_by(town_id) %>%  
    dplyr::filter(flag==max(flag)) %>% 
    filter(row_number()==1) %>% ungroup()

^{Created on 2020-05-22 by the reprex package (v0.3.0)}

@coffeinjunky - You are correct. I amended my post to take that into account. — user12728748, May 22 '20 at 17:40

Replacing value in a row based on value in another row with same identifier using R

2 Answers2