3

Assume I have the following database df.

df <- data.frame(ID= c("A", "B", "C"),
             Var1 = c(234, 12, 345),
             Var2 = c(4, 555, 325),
             Var3 = c("45|221|2", "982", NA))

> df
  ID Var1 Var2     Var3
1  A  234    4 45|221|2
2  B   12  555      982
3  C  345  325     <NA>

I would like to create a data.frame in which Var1 and Var2 is combined with the elements in Var3 by ID.

The outcome I am looking for should look like the following:

> outcome
  ID VarA VarB
1  A  234   45
2  A  234  221
3  A  234    2
4  A    4   45
5  A    4  221
6  A    4    2
7  B   12  982
8  B  555  982

Note that:

  • the elements in Var3 are separated by a vertical bar |
  • ID == C is not in outcome because Var3 is NA for that ID.

The original data consists of millions of IDs.

wake_wake
  • 1,332
  • 2
  • 19
  • 46
  • This seems to be a dupe of this: https://stackoverflow.com/questions/13773770/split-comma-separated-strings-in-a-column-into-separate-rows in combination with this https://stackoverflow.com/questions/2185252/reshaping-data-frame-from-wide-to-long-format – markus Jan 22 '19 at 21:53
  • @markus Indeed, there are a bunch of entries on moving from wide to long, and how to split a character string. The tricky part here, I guess, is to properly stack Var1-Var3 and Var2-Var3 by ID, and to do this efficiently. – wake_wake Jan 22 '19 at 21:58
  • What about `library(tidyverse); df %>% filter(ID != "C") %>% separate_rows(Var3, sep = "\\|") %>% gather(key, Var2, -ID, -Var3) %>% select(-key) %>% arrange(ID)` ? – markus Jan 22 '19 at 22:01
  • @markus I didn't see you wrote a similar answer to mine until posting refreshed the page. If you want to post yours, I'll remove mine, as I think `separate_rows` is a more logical function to use for this problem, even though `str_split` and `unnest` have the same result. – Mako212 Jan 22 '19 at 22:27
  • 2
    @Mako212 Don't worry. Edit your answer and include `separate_rows` if you like. – markus Jan 22 '19 at 22:29

2 Answers2

2

We can use tidyverse for a fairly elegant solution. The general idea is that we can use separate_rows to expand Var3 into rows, we just need to get Var1/Var2 into a suitable long format so we don't unnecessarily duplicate values.

library(tidyverse)
library(stringr)

df %>% gather(variable, value, -ID, -Var3) %>% # pull Var1 and Var2 into 
  # a single pair of key/value columns
  separate_rows(Var3, sep = "\\|") %>% # split Var3 into rows for each value
  drop_na(Var3) %>% # drop the NA rows
  select(ID, VarA = value, VarB = Var3, -variable) %>%
  arrange(ID)

  ID VarA VarB
1  A  234   45
2  A  234  221
3  A  234    2
4  A    4   45
5  A    4  221
6  A    4    2
7  B   12  982
8  B  555  982
Mako212
  • 6,787
  • 1
  • 18
  • 37
1

With tidyverse and splitstackshape you can do:

df %>%
 filter(!is.na(Var3)) %>%
 select(-Var3) %>%
 gather(var, VarA, -ID) %>%
 select(-var) %>%
 full_join(df %>%
            filter(!is.na(Var3)) %>%
            cSplit("Var3", sep = "|") %>%
            select(-Var1, -Var2) %>%
            gather(var, VarB, -ID, na.rm = TRUE) %>%
            select(-var), by = c("ID" = "ID")) %>%
 arrange(ID, VarA, VarB)

  ID VarA VarB
1  A    4    2
2  A    4   45
3  A    4  221
4  A  234    2
5  A  234   45
6  A  234  221
7  B   12  982
8  B  555  982

First, it filters out the rows where there is a NA on "Var3". Second it transforms the data from wide to long format, without the variable "Var3". Finally, it performs a full join with the df where the rows with NA on "Var3" were filtered out and "Var3" was split based on "|" and then transformed to wide to long format, without "Var1" and "Var2".

tmfmnk
  • 38,881
  • 4
  • 47
  • 67