0

I'm trying to summarise a dataframe by ID but I would like to summarise only when the same ID repeats in two rows and not summarise when the same ID repeats in more than two rows. I have tried:

df %>% group_by(ID) %>% dplyr::summarize_if(n() == 2, first),

but Im getting an error says: "Error in n(): ! Must be used inside dplyr verbs."

tnx for helping!

Maya Eldar
  • 49
  • 5

1 Answers1

2
dat <- data.frame(id=c(1,1,2,2,2), val=1:5)
library(dplyr)
dat %>%
  group_by(id) %>%
  summarize(val = if (n() == 2L) first(val) else val) %>%
  ungroup()
# # A tibble: 4 × 2
#      id   val
#   <dbl> <int>
# 1     1     1
# 2     2     3
# 3     2     4
# 4     2     5

You shouldn't use summarize_if here for two reasons:

  1. It's been superseded, preferring summarize(across(...)), with or without the use of dplyr::where; and
  2. The _if variant chooses the columns to summarize, not the rows.

To use the across(..) variant for multiple columns, we can do this:

dat %>%
  group_by(id) %>%
  summarize(across(everything(), ~ if (n() == 2L) first(.) else .)) %>%
  ungroup()
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • thank u! it was really helpful but created data frame contains only one column-the id column, how can I summarise all columns? I also couldn't make the across variant u suggested, I'm getting a long error: Error in `dplyr::summarise()`: ! Problem while computing `..1 = across(everything(), ~if (n() == 2L) first(.))`. ℹ The error occurred in group 730: ID = NA. Caused by error in `across()`: Caused by error in `dplyr_internal_error()`: Backtrace: 1. mess_data_duoble %>% group_by(ID) %>% ... 11. dplyr:::dplyr_internal_error("dplyr:::summarise_mixed_null", ``) – Maya Eldar Mar 26 '23 at 23:05
  • 1
    If you ran anything close to the fake data I provided, it is not feasible (in any way I can imagine) to produce only a single column, `summarize` will at a minimum keep the one column (and all if using `across(everything(), ...)`). Since it works perfectly with my sample data, perhaps you should provide a sample of your own data? Please with at least one example each of "2" and "> 2" rows per group so that we can clearly see the difference you are expecting. We don't need dozens of columns, perhaps a few (other than the grouping column(s)) should suffice. – r2evans Mar 26 '23 at 23:43
  • BTW, if you did just `across(everything(), ~if (n() == 2L) first(.))`, then you didn't try my code. By not having an `else` clause, the `if` statement assigns `NULL` to that column, and that will (obviously) fail. The use of `else` here is not optional; the only time I think it might be optional is if you were working on a list-column (which I am not showing nor suggesting). – r2evans Mar 26 '23 at 23:45
  • but the code with ```else``` shows a syntax error... :( – Maya Eldar Mar 28 '23 at 10:18
  • It doesn't error with my data. You haven't shared your data. Please see https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info for discussions on how to make the question _reproducible_ by sharing data. Best of the bunch is the use of `dput`, `data.frame`, or `read.table`, so that we can work with the same type of data you're working with. We don't need a lot, just enough rows and columns to get the point across. – r2evans Mar 28 '23 at 10:20
  • If someone else using it- in my R Studio it shows a syntax error but worked perfectly with the error.. #r2evans tnxxxxx – Maya Eldar Mar 30 '23 at 22:45