2

I was trying to find the answer to this but couldn't. If there is an answer I apologize and will immediately delete my question.

I'm trying to merge several rows into one (this calculation should be done separately on groups, in this case variable id can be used to group), so that no NA values are left.

# initial dataframe
df_start <- data.frame(
  id = c("as", "as", "as", "as", "as", "bs", "bs", "bs", "bs", "bs"), 
  b = c(NA, NA, NA, NA, "A", NA, NA, 6, NA, NA), 
  c = c(2, NA, NA, NA, NA, 7, NA, NA, NA, NA), 
  d = c(NA, 4, NA, NA, NA, NA, 8, NA, NA, NA), 
  e = c(NA, NA, NA, 3, NA, NA, NA, NA, "B", NA), 
  f = c(NA, NA, 5, NA, NA, NA, NA, NA, NA, 10))

# desired output
df_end <- data.frame(id = c("as", "bs"), 
                     b = c("A", 6), 
                     c = c(2, 7), 
                     d = c(4, 8), 
                     e = c(3,"B"), 
                     f = c(5, 10))
adl
  • 1,390
  • 16
  • 36

2 Answers2

5

No need to delete the question, it may be helpful to some users. This summarises each group to the first non NA occurrence for each column.

library(dplyr)

df_start <- data.frame(
  id = c("as", "as", "as", "as", "as", "bs", "bs", "bs", "bs", "bs"), 
  b = c(NA, NA, NA, NA, "A", NA, NA, 6, NA, NA), 
  c = c(2, NA, NA, NA, NA, 7, NA, NA, NA, NA), 
  d = c(NA, 4, NA, NA, NA, NA, 8, NA, NA, NA), 
  e = c(NA, NA, NA, 3, NA, NA, NA, NA, "B", NA), 
  f = c(NA, NA, 5, NA, NA, NA, NA, NA, NA, 10))

df_start %>% 
  group_by(id) %>% 
  summarise_all(list(~first(na.omit(.))))

Output:

# A tibble: 2 x 6
  id    b         c     d e         f
  <fct> <fct> <dbl> <dbl> <fct> <dbl>
1 as    A        2.    4. 3        5.
2 bs    6        7.    8. B       10.

You will, of course, get some data lost if there is more than one occurrence of a value with each group for each column.

Jack Brookes
  • 3,720
  • 2
  • 11
  • 22
  • What is this now that `funs` is deprecated? – Brian Wiley Apr 04 '21 at 08:37
  • @BrianWiley use `list` https://dplyr.tidyverse.org/reference/summarise_all.html – Jack Brookes Apr 04 '21 at 20:09
  • Thanks I got it to work with a lambda. I just don't think mine is finishing due to memory because its a dataframe that is 900k rows and 214 variables with 514k groups. I used dplyr full joins and removed duplicate columns instead. – Brian Wiley Apr 04 '21 at 20:10
1

Hope this helps, Using dplyr

df_start <- sapply(df_start, as.character)
df_start[is.na(df_start)] <- " "
df_start <- as.data.frame(df_start)

library(dplyr)   
df_start %>% 
  group_by(id) %>% 
  summarise_all(funs(trimws(paste(., collapse = '')))) -> df
Chaitu
  • 151
  • 2
  • 9