0

I am trying to solve a problem in R that seems simple, but I can't work it out.

I have the following data:

tmp

  town_id city_ id flag
1   10500      111    1           
2   15300     1110    1           
3    9400    11100    1           
4    9400    11101    0           
5    9600    11102    0          
6    9800    11103    0           

There is a duplicate town_id, and I would like to remove it while assigning the highest value in flag. That is, I would like to have:

  town_id city_ id flag
1   10500      111    1           
2   15300     1110    1           
3    9400    11100    1                     
4    9600    11102    0          
5    9800    11103    0           

I tried to use the following dplyr code, but it assigns a 1 to everything:

tmp_2<-tmp %>% group_by(town_id) %>% mutate(flag=max(flag))

tmp_2

  town_id city_ id  flag 
1 10500   111          1           
2 15300   1110         1           
3 9400    11100        1           
4 9400    11101        1           
5 9600    11102        1           
6 9800    11103        1           

Could someone please tell me what I am doing wrong?

Thank you.

Cola
  • 41
  • 3
  • Perhaps a case of [Why does summarize or mutate not work with group_by when I load `plyr` after `dplyr`?](https://stackoverflow.com/questions/26106146/why-does-summarize-or-mutate-not-work-with-group-by-when-i-load-plyr-after-dp) (and don't you want `summarize`?) – Henrik May 22 '20 at 16:21
  • Could you specify more explicitly how you would want `city_id` to be treated? Should the lowest value be used? What if there are more than two duplicates with the same flag? – coffeinjunky May 22 '20 at 16:58
  • Thank you. I did not think about this, but I would want the lowest value for city_id. – Cola May 22 '20 at 17:28

2 Answers2

0

Its hacky, but I think it should work:

    tmp = arrange(tmp,-flag) 
    tmp = tmp[!(tmp$town_id %>% duplicated),]
dvd280
  • 876
  • 6
  • 11
0

I suppose this is what you want - filter out the duplicate:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
tmp <- structure(list(town_id = c(10500L, 15300L, 9400L, 9400L, 9600L, 9800L), 
                      city_id = c(111L, 1110L, 11100L, 11101L, 11102L, 11103L), 
                      flag = c(1L, 1L, 1L, 0L, 0L, 0L)), 
                 class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6"))
tmp_2 <- tmp %>% group_by(town_id) %>% dplyr::filter(flag==max(flag)) %>% ungroup()
tmp_2
#> # A tibble: 5 x 3
#>   town_id city_id  flag
#>     <int>   <int> <int>
#> 1   10500     111     1
#> 2   15300    1110     1
#> 3    9400   11100     1
#> 4    9600   11102     0
#> 5    9800   11103     0

Edit: In case there are more than one per group with the maximum value of flag and you only want to retain the first one, you could do:

tmp_2 <- tmp %>% group_by(town_id) %>%  
    dplyr::filter(flag==max(flag)) %>% 
    filter(row_number()==1) %>% ungroup()

Created on 2020-05-22 by the reprex package (v0.3.0)

user12728748
  • 8,106
  • 2
  • 9
  • 14