7

Consider the following dataframe:

dummy_df <- tibble(
  A=c("ABC", "ABC", "BCD", "CDF", "CDF", "CDF"),
  B=c(0.25, 0.25, 1.23, 0.58, 0.58, 0.58),
  C=c("lorem", "ipsum", "dolor", "amet", "something", "else"),
  D=c("up", "up", "down", "down", "down", "down"),
  E=c(132, 132, 243, 512, 512, 512),
  F=c("m1", "m1", "m5", "m3", "m3", "m3"),
  G=c("val", "val", "mur", "mad", "mad", "mad"),
  H=c("grx", "grx", "bcn", "sal", "sal", "sal"),
  I=c(1.68, 1.68, 2.31, 3.12, 3.12, 3.12),
  J=c("p", "p", "f", "p", "p", "p"),
  K=c(100, 100, 200, 143, 143, 143)
)

# A tibble: 6 × 11
  A         B C         D         E F     G     H         I J         K
  <chr> <dbl> <chr>     <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 ABC    0.25 lorem     up      132 m1    val   grx    1.68 p       100
2 ABC    0.25 ipsum     up      132 m1    val   grx    1.68 p       100
3 BCD    1.23 dolor     down    243 m5    mur   bcn    2.31 f       200
4 CDF    0.58 amet      down    512 m3    mad   sal    3.12 p       143
5 CDF    0.58 something down    512 m3    mad   sal    3.12 p       143
6 CDF    0.58 else      down    512 m3    mad   sal    3.12 p       143

After reading this, I have managed to collapse column C so that its value is concatenated into a string for every unique row value of column A.

dummy_df %>% group_by(A) %>% summarise(hits = toString(C), nhits=n())

# A tibble: 3 × 3
  A     hits                  nhits
  <chr> <chr>                 <int>
1 ABC   lorem, ipsum              2
2 BCD   dolor                     1
3 CDF   amet, something, else     3

However, I'm loosing every other column information, which are essential to me. How can I retain information about all the columns while collapsing column C? Ideally it should be done without having to hard-code the columns names, since the number of columns can vary depending on the dataset.

I have read this, but the example shown doesn't create new variables so I haven't been able to make it work.

This is what I'm looking for:

# A tibble: 3 × 12
  A     hits                  nhits     B D         E F     G     H         I J         K
  <chr> <chr>                 <int> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 ABC   lorem, ipsum              2  0.25 up      132 m1    val   grx    1.68 p       100
2 BCD   dolor                     1  1.23 down    243 m5    mur   bcn    2.31 f       200
3 CDF   amet, something, else     3  0.58 down    512 m3    mad   sal    3.12 p       143
jpm92
  • 143
  • 1
  • 8

3 Answers3

5

If we assume that all fields other than C will contain repeating values for each A-group, then we can do this:

dummy_df %>%
  group_by(A) %>%
  summarize(
    nhits = n(),
    across(where(is.character), ~ toString(unique(.))),
    across(where(~ !is.character(.)), ~ unique(.))
  ) %>%
  ungroup()
# # A tibble: 3 × 12
#   A     nhits C                     D     F     G     H     J         B     E     I     K
#   <chr> <int> <chr>                 <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 ABC       2 lorem, ipsum          up    m1    val   grx   p      0.25   132  1.68   100
# 2 BCD       1 dolor                 down  m5    mur   bcn   f      1.23   243  2.31   200
# 3 CDF       3 amet, something, else down  m3    mad   sal   p      0.58   512  3.12   143

In this case, since all other variables are repeating, everything stays as-is. If, however, we have variability, then I've set up a catch, where strings are combined (with toString) and differing numbers cause repeat rows.

dummy_df$H[1] <- "GRX"
dummy_df$K[1] <- 99
 %>%
  group_by(A) %>%
  summarize(
    nhits = n(),
    across(where(is.character), ~ toString(unique(.))),
    across(where(~ !is.character(.)), ~ unique(.))
  ) %>%
  ungroup()
# A tibble: 4 × 12
  A     nhits C                     D     F     G     H        J         B     E     I     K
  <chr> <int> <chr>                 <chr> <chr> <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl>
1 ABC       2 lorem, ipsum          up    m1    val   GRX, grx p      0.25   132  1.68    99
2 ABC       2 lorem, ipsum          up    m1    val   GRX, grx p      0.25   132  1.68   100
3 BCD       1 dolor                 down  m5    mur   bcn      f      1.23   243  2.31   200
4 CDF       3 amet, something, else down  m3    mad   sal      p      0.58   512  3.12   143

The across function iterates over multiple columns. The where function allows us to subset columns by something about their values, where for character columns we will apply toString, and for others we'll go with unique. This means that string columns should never add rows, but non-strings may.

An alternative for the number-like columns would be to use first(.) instead of unique(.), which will silently discard all but the first value in each column by group. Using the same modified data, we would see this, where the key difference is that K has discarded the 100 values since the first value is 99.

dummy_df %>%
  group_by(A) %>%
  summarize(
    nhits = n(), 
    across(where(is.character), ~ toString(unique(.))),
    across(where(~ !is.character(.)), ~ first(.))
  ) %>%
  ungroup()
# # A tibble: 3 × 12
#   A     nhits C                     D     F     G     H        J         B     E     I     K
#   <chr> <int> <chr>                 <chr> <chr> <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl>
# 1 ABC       2 lorem, ipsum          up    m1    val   GRX, grx p      0.25   132  1.68    99
# 2 BCD       1 dolor                 down  m5    mur   bcn      f      1.23   243  2.31   200
# 3 CDF       3 amet, something, else down  m3    mad   sal      p      0.58   512  3.12   143

You can choose to use first with string columns as well, if you choose, which could simplify the logic to a single across:

dummy_df %>%
  group_by(A) %>%
  summarize(
    nhits = n(),
    across(everything(), ~ first(.))
  ) %>%
  ungroup()
# # A tibble: 3 × 12
#   A     nhits     B C     D         E F     G     H         I J         K
#   <chr> <int> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
# 1 ABC       2  0.25 lorem up      132 m1    val   GRX    1.68 p        99
# 2 BCD       1  1.23 dolor down    243 m5    mur   bcn    2.31 f       200
# 3 CDF       3  0.58 amet  down    512 m3    mad   sal    3.12 p       143

Note that we've lost H's "grx" (since "GRX" was first) as well as K's value of 100.

The data.table variant of this last code-block can be simply:

library(data.table)
as.data.table(dummy_df)[, c(.(nhits = .N), lapply(.SD, first)), by = A]

and a base R rendition? Perhaps not as "pretty" :-)

Reduce(
  function(a, b) merge(a, b, by = "A", all = TRUE), 
  list(
    setNames(aggregate(C ~ A, dummy_df, FUN = length), c("A", "nhits")), 
    aggregate(C ~ A, dummy_df, FUN = toString),
    aggregate(. ~ A, subset(dummy_df, select = -C), FUN = function(z) z[1])
  )
)
#     A nhits                     C    B    D   E  F   G   H    I J   K
# 1 ABC     2          lorem, ipsum 0.25   up 132 m1 val GRX 1.68 p  99
# 2 BCD     1                 dolor 1.23 down 243 m5 mur bcn 2.31 f 200
# 3 CDF     3 amet, something, else 0.58 down 512 m3 mad sal 3.12 p 143
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • The question has been updated with a better reprex, I didn't make myself clear, sorry for that. – jpm92 May 05 '23 at 08:26
4

We could do it this way:

library(dplyr) #>= 1.1.0
dummy_df %>%
  summarise(hits = toString(C), nhits = n(), B = unique(B), .by = A)

OR

library(dplyr)

dummy_df %>%
  group_by(A) %>%
  summarise(hits = toString(C), nhits = n(), B = unique(B))
 A     hits                  nhits     B
  <chr> <chr>                 <int> <dbl>
1 ABC   lorem, ipsum              2  0.25
2 BCD   dolor                     1  1.23
3 CDF   amet, something, else     3  0.58
> 
TarJae
  • 72,363
  • 6
  • 19
  • 66
  • Sorry, I mentioned it but maybe I wasn't clear. I need to retain every column in the original df, not only B. Is it still possible to accomplish this with this approach? (without having to write down manually 20 columns, of course :D) – jpm92 May 04 '23 at 18:37
  • @jpm92 Just `group_by(A, B)` if B is always the same – Azor Ahai -him- May 04 '23 at 18:39
  • I am sorry but my output is exact the same as your desired output? – TarJae May 04 '23 at 18:42
  • I know it is, but as I mentioned, I have 20 more columns apart from B that I need to retain that I didn't include in the reprex to simplify it, how could I retain all of them without having to manually specify them in the code? – jpm92 May 04 '23 at 19:31
  • Could you add an extended example? – TarJae May 05 '23 at 06:35
  • Question has been updated. I have managed to accomplish what I was looking for with reframe() and across() but it's still not the best approach I think. Let me know if you can improve it and thanks a lot, really :) – jpm92 May 05 '23 at 08:04
4

Using data.table

library(data.table)
setDT(dummy_df)[, .(B = first(B), hits = toString(C), nhits = .N), by = A]
akrun
  • 874,273
  • 37
  • 540
  • 662