2

Similar to this question but with an added twist:

Given the following data frame:

txt <- "ID    Col1    Col2    Col3    Col4
        1     6       10      NA      NA
        1     5       10      NA      NA
        1     NA      10      15      20
        2     17      25      NA      NA
        2     13      25      NA      NA
        2     NA      25      21      34
        2     NA      25      35      40"
DF <- read.table(text = txt, header = TRUE)

DF
  ID Col1 Col2 Col3 Col4
1  1    6   10   NA   NA
2  1    5   10   NA   NA
3  1   NA   10   15   20
4  2   17   25   NA   NA
5  2   13   25   NA   NA
6  2   NA   25   21   34
7  2   NA   25   35   40

I wish to collapse the rows by group ID (analogous to Col2 in this example), and when more than 1 combination is present per group, to return all combinations, as so:

  ID Col1 Col2 Col3 Col4
1  1    6   10   15   20
2  1    5   10   15   20
3  2   17   25   21   34
4  2   13   25   21   34
5  2   17   25   35   40
6  2   13   25   35   40

Importantly, down the road I'll need this to work on non-numerical data. Any suggestions? Thanks!

Aaron
  • 109
  • 4
  • 1
    Row 3 of the original gets collapsed into rows 1-2. Rows 6-7 get collapsed into rows 4-5 with a total of 4 combinations. – Aaron Sep 23 '21 at 23:53

3 Answers3

4

Grouped by 'ID', fill other columns, ungroup to remove the group attribute and keep the distinct rows

library(dplyr)
library(tidyr)
DF %>% 
    group_by(ID) %>% 
    fill(everything(), .direction = 'updown') %>%
    ungroup %>% 
    distinct(.keep_all = TRUE)

Or may also be

DF %>% 
   group_by(ID) %>% 
   mutate(across(everything(), ~ replace(., is.na(.), 
           rep(.[!is.na(.)], length.out = sum(is.na(.))))))

Or based on the comments

DF %>%
   group_by(ID) %>%
   mutate(across(where(~ any(is.na(.))), ~ {
        i1 <- is.na(.)
        ind <- which(i1)
        i2 <- !i1
        if(i1[1] == 1) rep(.[i2], each = n()/sum(i2)) else 
               rep(.[i2], length.out = n())
     })) %>%
   ungroup %>% 
   distinct(.keep_all = TRUE)

-output

# A tibble: 6 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     6    10    15    20
2     1     5    10    15    20
3     2    17    25    21    34
4     2    13    25    21    34
5     2    17    25    35    40
6     2    13    25    35    40
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks! These two options give different results. The desired output would be a de-duplicated version of your second response. The first response is missing a collapsed combination of rows 4&7 in the original post – Aaron Sep 23 '21 at 23:56
  • @Aaron at the time I posted this, I had doubts whether your expected was a typo or not – akrun Sep 23 '21 at 23:56
  • understood, the expected is not a typo, thanks! – Aaron Sep 23 '21 at 23:57
  • @Aaron I have doubt in the Col1 and Col3, in the arrangement of elements. In Col1, the last 2 elements of ID 2 was missing so , it was 17 13 17 13 repeated, whereas in Col3, first 2 elements are missing, but it was changed to 21 21 35 35 in expected i.e. breaking the order of 21 35 – akrun Sep 23 '21 at 23:59
  • @Aaron So, my question is more related to the logic behind those difference in filling – akrun Sep 24 '21 at 00:00
  • @Aaron i.e. why is 17 13in alternating order whereas 21 35 is repeated order – akrun Sep 24 '21 at 00:10
  • @Aaron I updated the post to get the expecetd output, but I am not sure whether the logic you wanted is that – akrun Sep 24 '21 at 00:22
  • 1
    thanks! for my purposes I don't care about the arrangement of elements so long as all combinations of Col1 and (Col3&Col4) per group ID exist in the output – Aaron Sep 24 '21 at 01:31
  • 1
    @Aaron The updated solution gives the expected as in your post. It may need some tweaking when the number of NA elements are different – akrun Sep 24 '21 at 01:32
3

A data.table option using zoo's na.locf to fill the missing values.

library(zoo)
library(data.table)

setDT(DF)
cols <- grep('Col', names(DF), value = TRUE)
DF[, (cols) := lapply(.SD, function(x) fcoalesce(na.locf(x, na.rm = FALSE), 
                      na.locf(x, na.rm = FALSE, fromLast = TRUE))), ID]
unique(DF)

#   ID Col1 Col2 Col3 Col4
#1:  1    6   10   15   20
#2:  1    5   10   15   20
#3:  2   17   25   21   34
#4:  2   13   25   21   34
#5:  2   13   25   35   40
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
2

In a comment, the OP pointed out:

for my purposes I don't care about the arrangement of elements so long as all combinations of Col1 and (Col3&Col4) per group ID exist in the output

So, if I understand correctly, the question is not about collapsing but about creating all possible combinations of non-NA values of columns Col1, Col2, and combined columns (Col3, Col4) for each ID group.

For this, expand() and nesting() from the package can be used to create the combinations. na.omit() removes all rows containing any NA afterwards:

library(dplyr)
library(tidyr)
DF %>% 
  group_by(ID) %>% 
  expand(Col1, Col2, nesting(Col3, Col4)) %>% 
  na.omit() %>% 
  ungroup()
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     1     6    10    15    20
3     2    13    25    21    34
4     2    13    25    35    40
5     2    17    25    21    34
6     2    17    25    35    40

This approach should work well also with non-numerical data.

Edit 1

Giving a second thought, I was wondering about the peculiar structure of the input dataset, namely the positions of the NAs:

DF
  ID Col1 Col2 Col3 Col4
1  1    6   10   NA   NA
2  1    5   10   NA   NA
3  1   NA   10   15   20
4  2   17   25   NA   NA
5  2   13   25   NA   NA
6  2   NA   25   21   34
7  2   NA   25   35   40

To me, it appears as if DF was constructed from three separate subsets, the first one for Col1

  ID Col1
1  1    6
2  1    5
4  2   17
5  2   13

the second one for Col2

  ID Col2
1  1   10
4  2   25

the third one for Col3 and Col4

  ID Col3 Col4
3  1   15   20
6  2   21   34
7  2   35   40

Based on this observation, here is a different approach which creates all possible combinations of the subsets by a series of merge operations (cartesian joins) of the subsets:

library(magrittr) # piping used her to improve readability
list("Col1", "Col2", c("Col3", "Col4")) %>% 
  lapply(function(x) DF[c("ID", x)] %>% na.omit %>% unique) %>% 
  Reduce(merge, .)
  ID Col1 Col2 Col3 Col4
1  1    6   10   15   20
2  1    5   10   15   20
3  2   17   25   21   34
4  2   17   25   35   40
5  2   13   25   21   34
6  2   13   25   35   40

Here, lapply() creates a list of subsets of the input dataset which is then merged repeatedly using Reduce().

Edit 2:

With version 4.1.0, R has gained a simple native forward pipe syntax |> and \() as a shorthand notation for function(). With this, the code of Edit 1 can be re-written to use only base R (without ):

list("Col1", "Col2", c("Col3", "Col4")) |> 
  lapply(\(x) DF[c("ID", x)] |> na.omit() |> unique()) |>
  (\(z) Reduce(merge, z))()
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Thanks! You understand correctly, and I edited the question title to make my goal more clear. – Aaron Sep 24 '21 at 16:09