Duplicating observations of a dataframe, but also replacing specific variable values in R

Question

I am looking for some advice on some data restructuring. I am collecting some data using Google Forms which I download as a csv file and looks something like the following:

# alpha                 beta    option
#  6             8, 9, 10, 11    apple
#  9                        6     pear
#  1                        6    apple
#  3                     8, 9     pear
#  3                     6, 8     lime
#  3                        1    apple
#  2, 4, 7, 11              9     lime

The data has two variables (alpha and beta) that each list numbers. For the majority of my data there is only one number in each variable. However, for some observations there can be two, three or even up to ten numbers. This is because these are responses gathered using the 'checkbox' option in google forms which allows multiple answers to one survey question. Also, it may be of importance to some potential solutions that google forms returns leading whitespace before each of the multiple answers.

In my real data this occurs in only a very small proportion of all observations, the above is a more condensed example. There are several other variables in the dataset. Here I am just including one called 'option' that contains factors.

What I need to do is to duplicate all observations that contain multiple numbers in either the 'alpha' or the 'beta' variable. The number of duplicated rows should be equal to the number of numbers that exist in the alpha or beta variable. Then, I need to replace the sequence of numbers in the 'alpha' or 'beta' variables with each number independently. That would result in something like the following:

#  alpha  beta   option
#     6    8     apple
#     6    9     apple
#     6   10     apple
#     6   11     apple
#     9    6      pear
#     1    6     apple
#     3    8      pear
#     3    9      pear
#     3    6      lime
#     3    8      lime
#     3    1     apple
#     2    9      lime
#     4    9      lime
#     7    9      lime
#    11    9      lime

Here is the data that reproduces the original example data above. I have called the dataframe 'demo':

demo<-structure(list(alpha = structure(c(4L, 5L, 1L, 3L, 3L, 3L, 2L), .Label =
 c("1","2, 4, 7, 11", "3", "6", "9"), class = "factor"), beta = structure(c(5L, 2L, 2L, 
4L, 3L, 1L, 6L), .Label = c("1", "6", "6, 8", "8, 9", "8, 9, 10, 11", "9"), class =   
"factor"), option = structure(c(1L, 3L, 1L, 3L, 2L, 1L, 2L), .Label = c("apple", 
"lime", "pear"), class = "factor")), .Names = c("alpha", "beta", "option"), class =   
"data.frame", row.names = c(NA, -7L))

OK. So I think I have written some code that in a very long-winded fashion does lead to the new dataframe I am looking for. However, it feels like there must be a more elegant and better way of doing it.

Basically, I work on the 'alpha' variable first. I first subset the observations based on whether commas exist in the variable or not. With the observations that contain commas, I then use strsplit to separate the numbers. I then count how many numbers exist for each observation and duplicate each observation by that. I then melt the split numbers into a dataframe with all the numbers in a variable named 'value'. I simply then replace the 'alpha' variable with the data in the melted 'value' variable. I then rbind this back with the data that did not contain commas. I then use this df and work on the 'beta' variable....

Here is my solution (it seems to work?):

library(reshape2)

demo$a<-grepl(",", demo$alpha)
demo.atrue <- demo[ which(demo$a=='TRUE'), ]
demo.afalse <- demo[ which(demo$a=='FALSE'), ]
demo.atrue$alpha<-as.character(demo.atrue$alpha)
temp<-strsplit(demo.atrue$alpha, ",")
temp.lengths<-lapply(temp, length)

for (i in 1:length(temp)) { 
df.expanded <- demo.atrue[rep(row.names(demo.atrue), temp.lengths), 1:3]
}

temp.melt<-melt(temp)
df.expanded$alpha<-temp.melt$value
demo.afalse<-demo.afalse[c(1:3)]
demonew<-rbind(demo.afalse, df.expanded)



demonew$b<-grepl(",", demonew$beta)
demonew.btrue <- demonew[ which(demonew$b=='TRUE'), ]
demonew.bfalse <- demonew[ which(demonew$b=='FALSE'), ]
demonew.btrue$beta<-as.character(demonew.btrue$beta)

temp<-strsplit(demonew.btrue$beta, ",")
temp.lengths<-lapply(temp, length)

for (i in 1:length(temp)) { 
  df.expanded1 <- demonew.btrue[rep(row.names(demonew.btrue), temp.lengths), 1:3]
}

temp.melt<-melt(temp)
df.expanded1$beta<-temp.melt$value
demonew.bfalse<-demonew.bfalse[c(1:3)]
demonew1<-rbind(df.expanded1, demonew.bfalse)

demonew1  #this seems to work, but doesn't feel very efficient

As well as perhaps not being very efficient, I am not sure whether this will work in all conditions. In particular if multiple numbers exist in both the 'alpha' and 'beta' variables for the same observation. I have tested it with a few examples and it seems ok, but I am not confident with it.

Thank you for any consideration.

This is like a multiple variable version of: http://stackoverflow.com/questions/13773770/split-comma-separated-column-into-separate-rows — thelatemail, Jun 27 '14 at 03:05

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2014-06-27T06:01:13.600

You can use my cSplit function, nested twice, like this:

cSplit(cSplit(demo, "alpha", ",", "long"), "beta", ",", "long")
#     alpha beta option
#  1:     6    8  apple
#  2:     6    9  apple
#  3:     6   10  apple
#  4:     6   11  apple
#  5:     9    6   pear
#  6:     1    6  apple
#  7:     3    8   pear
#  8:     3    9   pear
#  9:     3    6   lime
# 10:     3    8   lime
# 11:     3    1  apple
# 12:     2    9   lime
# 13:     4    9   lime
# 14:     7    9   lime
# 15:    11    9   lime

Some benchmarks:

More interesting sample data. 700 rows instead of 7 (still quite a small dataset)...

demo <- do.call(rbind, replicate(100, demo, FALSE))
library(data.table)
demo2 <- data.table(demo)

Functions to test...

## MrFlick's
fun1 <- function() {
  do.call(rbind, with(demo, Map(expand.grid,
                                alpha = strsplit(alpha,", "),
                                beta = strsplit(beta, ", "),
                                option = option
  )))
} 

## Mine
fun2 <-  function() {
  cSplit(cSplit(demo2, "alpha", ",", "long"), "beta", ",", "long")
} 

## thelatemail's one-liner
fun3 <- function() {
  do.call(rbind,do.call(Map, c(expand.grid, lapply(demo, strsplit, ", "))))
}

The actual benchmarking...

library(microbenchmark)
microbenchmark(MF = fun1(), AM = fun2(), TH = fun3(), times = 10)
# Unit: milliseconds
#  expr       min        lq    median        uq       max neval
#    MF 785.34875 789.94924 800.11046 800.93643 813.62390    10
#    AM  11.54569  11.93483  12.14181  12.31329  12.93208    10
#    TH 790.46069 799.68518 803.47294 827.69520 899.11219    10

score 5 · Answer 2 · answered Jun 27 '14 at 01:43

5

Actually this shouldn't be too bad. First, for simplicity, i'm going to convert all the columns to characters to make the later splits easier

demo[] <- lapply(demo, as.character)

Now let's do the hard work. Basically i'll split the "alpha" and "beta" columns on the ", " separator. Then i'll use expand.grid to combine all the elements of "alpha","beta", and "option." This will take care of repeating the necessary rows and will work if both "alpha" and "beta" have multiple values. Finally, i'll re-combine all the newly generated rows into one great big data.frame. Here's the code

do.call(rbind, with(demo, Map(expand.grid, 
     alpha = strsplit(alpha,", "), 
     beta = strsplit(beta, ", "), 
     option = option
)))

And that's it. It will return

   alpha beta option
1      6    8  apple
2      6    9  apple
3      6   10  apple
4      6   11  apple
5      9    6   pear
6      1    6  apple
7      3    8   pear
8      3    9   pear
9      3    6   lime
10     3    8   lime
11     3    1  apple
12     2    9   lime
13     4    9   lime
14     7    9   lime
15    11    9   lime

answered Jun 27 '14 at 01:43

MrFlick

195,160
17
277
295

thanks - that's great. I need to do more reading on expand.grid and Map to fully understand what's going on. But this works! – jalapic Jun 27 '14 at 02:00
one quick point of clarification. When applying this solution to my real data that contains multiple other variables (including Timestamps, various strings, other variables with factors), I am getting error messages along the lines of "variableX not found". Within the with Map(expand.grid command, do the variables of the dataframe to be used need to be in any particular order? – jalapic Jun 27 '14 at 02:14
No, the variables do not need to be in any order. Just make sure to change the data.frame name in the `with()` statement from `demo` to whatever your actual name is. You may also only want to make sure the values you are splitting are characters rather than converting everything to character like I did. If you do, the data.types should be preserved in this transformation. – MrFlick Jun 27 '14 at 02:17
thanks - I tried both. It was as ridiculous as forgetting to put a comma at the end of each variable name. – jalapic Jun 27 '14 at 02:52
2

`do.call(rbind,do.call(Map, c(expand.grid, lapply(demo, strsplit, ", "))))` as a one liner. – thelatemail Jun 27 '14 at 03:44

Duplicating observations of a dataframe, but also replacing specific variable values in R

2 Answers2

Linked