1

I have a dataset arranged such that the data is stored as a list of multiple observations within each 'cell'. See below:

partID | Var 1 | Var 2
1      | 1,2,3 | 4,5,6
2      | 7,8,9 | 1,2,3

I would like to get the data in a format more like this:

partID | Var 1 | Var 2
1      | 1     | 4
1      | 2     | 5
1      | 3     | 6

I've been trying various combinations of melt, unlist, and data.table but I haven't had much luck applying the various ways to expand the lists while simultaneously preserving multiple columns and their names. Am I reduced to looping through the dataset and binding the columns together?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Andrew S
  • 85
  • 1
  • 9
  • 1
    Can you post some example data with the structure? Using `str(mydataframe)` would be particularly helpful. – Mike H. Jun 15 '16 at 22:24
  • Possible duplicate of [Split comma-separated column into separate rows](https://stackoverflow.com/questions/13773770/split-comma-separated-column-into-separate-rows) – Uwe Oct 16 '17 at 14:08

4 Answers4

1

If for each row, the cells have the same number of entries and they are strings, then this is what you can do, using data.table.

require(data.table)
DT<-data.table(partID=c(1,2),Var1=c("1,2,3","7,8,9"),Var2=c("4,5,6","1,2,3"))

DT2<-DT[,list(Var1=unlist(strsplit(Var1,",")),Var2=unlist(strsplit(Var2,","))),by=partID]

You use strsplit() to split the strings by the commas. You use unlist() to make the entries into a vector, not a list.

If, on the other hand, each cell is already a list, then all you need to do is unlist().

require(data.table)
DT3<-data.table(partID=c(1,2),Var1=list(c(1,2,3),c(7,8,9)),Var2=list(c(4,5,6),c(1,2,3)))

DT4<-DT3[,list(Var1=unlist(Var1),Var2=unlist(Var2)),by=partID]

Either way, you get this:

   partID Var1 Var2
      1    1    4
      1    2    5
      1    3    6
      2    7    1
      2    8    2
      2    9    3
mgriebe
  • 908
  • 5
  • 8
  • The `unlist()` option works fine for this case, but I was really hoping for something a little more general purpose in case I need to do this for other data sets with differently named variables. Thank you in any case. – Andrew S Jun 16 '16 at 15:31
  • 1
    data.table allows you to `lapply` over columns without specifying names: try `DT4<-DT3[,lapply(.SD,unlist),by=partID]` All of the repeated rows (in this case, just `partID`) need to be included in the `by` statement. You can specify which columns to unlist using the `.SDCols` argument. See `data.table` help. – mgriebe Jun 23 '16 at 15:20
1

We can do this easily with cSplit

library(splitstackshape)
cSplit(DT, c("Var1", "Var2"), ",", "long")
#    partID Var1 Var2
#1:      1    1    4
#2:      1    2    5
#3:      1    3    6
#4:      2    7    1
#5:      2    8    2
#6:      2    9    3

data

DT<-data.frame(partID=c(1,2),Var1=c("1,2,3","7,8,9"),Var2=c("4,5,6","1,2,3"))
akrun
  • 874,273
  • 37
  • 540
  • 662
1

The separate_rows() function in tidyr is the boss for observations with multiple delimited values...

# create data
library(tidyverse)
d <- data_frame(
  partID = c(1, 2),
  Var1 = c("1,2,3", "7,8,9"),
  Var2 = c("4,5,6","1,2,3")
)
d
# # A tibble: 2 x 3
#   partID  Var1  Var2
#    <dbl> <chr> <chr>
# 1      1 1,2,3 4,5,6
# 2      2 7,8,9 1,2,3

# tidy data
separate_rows(d, Var1, Var2, convert = TRUE)
# # A tibble: 6 x 3
#   partID  Var1  Var2
#    <dbl> <int> <int>
# 1      1     1     4
# 2      1     2     5
# 3      1     3     6
# 4      2     7     1
# 5      2     8     2
# 6      2     9     3
guyabel
  • 8,014
  • 6
  • 57
  • 86
0

You can also use dplyr and tidyr which provides the unnest function to expand the columns:

library(dplyr); library(tidyr);
df %>% mutate(Var.1 = strsplit(Var.1, ","), Var.2 = strsplit(Var.2, ",")) %>% unnest()

Source: local data frame [6 x 3]

  partID Var.1 Var.2
   (dbl) (chr) (chr)
1      1     1     4
2      1     2     5
3      1     3     6
4      2     7     1
5      2     8     2
6      2     9     3
Psidom
  • 209,562
  • 33
  • 339
  • 356