I am looking for some advice on some data restructuring. I am collecting some data using Google Forms which I download as a csv file and looks something like the following:
# alpha beta option
# 6 8, 9, 10, 11 apple
# 9 6 pear
# 1 6 apple
# 3 8, 9 pear
# 3 6, 8 lime
# 3 1 apple
# 2, 4, 7, 11 9 lime
The data has two variables (alpha and beta) that each list numbers. For the majority of my data there is only one number in each variable. However, for some observations there can be two, three or even up to ten numbers. This is because these are responses gathered using the 'checkbox' option in google forms which allows multiple answers to one survey question. Also, it may be of importance to some potential solutions that google forms returns leading whitespace before each of the multiple answers.
In my real data this occurs in only a very small proportion of all observations, the above is a more condensed example. There are several other variables in the dataset. Here I am just including one called 'option' that contains factors.
What I need to do is to duplicate all observations that contain multiple numbers in either the 'alpha' or the 'beta' variable. The number of duplicated rows should be equal to the number of numbers that exist in the alpha or beta variable. Then, I need to replace the sequence of numbers in the 'alpha' or 'beta' variables with each number independently. That would result in something like the following:
# alpha beta option
# 6 8 apple
# 6 9 apple
# 6 10 apple
# 6 11 apple
# 9 6 pear
# 1 6 apple
# 3 8 pear
# 3 9 pear
# 3 6 lime
# 3 8 lime
# 3 1 apple
# 2 9 lime
# 4 9 lime
# 7 9 lime
# 11 9 lime
Here is the data that reproduces the original example data above. I have called the dataframe 'demo':
demo<-structure(list(alpha = structure(c(4L, 5L, 1L, 3L, 3L, 3L, 2L), .Label =
c("1","2, 4, 7, 11", "3", "6", "9"), class = "factor"), beta = structure(c(5L, 2L, 2L,
4L, 3L, 1L, 6L), .Label = c("1", "6", "6, 8", "8, 9", "8, 9, 10, 11", "9"), class =
"factor"), option = structure(c(1L, 3L, 1L, 3L, 2L, 1L, 2L), .Label = c("apple",
"lime", "pear"), class = "factor")), .Names = c("alpha", "beta", "option"), class =
"data.frame", row.names = c(NA, -7L))
OK. So I think I have written some code that in a very long-winded fashion does lead to the new dataframe I am looking for. However, it feels like there must be a more elegant and better way of doing it.
Basically, I work on the 'alpha' variable first. I first subset the observations based on whether commas exist in the variable or not. With the observations that contain commas, I then use strsplit to separate the numbers. I then count how many numbers exist for each observation and duplicate each observation by that. I then melt the split numbers into a dataframe with all the numbers in a variable named 'value'. I simply then replace the 'alpha' variable with the data in the melted 'value' variable. I then rbind this back with the data that did not contain commas. I then use this df and work on the 'beta' variable....
Here is my solution (it seems to work?):
library(reshape2)
demo$a<-grepl(",", demo$alpha)
demo.atrue <- demo[ which(demo$a=='TRUE'), ]
demo.afalse <- demo[ which(demo$a=='FALSE'), ]
demo.atrue$alpha<-as.character(demo.atrue$alpha)
temp<-strsplit(demo.atrue$alpha, ",")
temp.lengths<-lapply(temp, length)
for (i in 1:length(temp)) {
df.expanded <- demo.atrue[rep(row.names(demo.atrue), temp.lengths), 1:3]
}
temp.melt<-melt(temp)
df.expanded$alpha<-temp.melt$value
demo.afalse<-demo.afalse[c(1:3)]
demonew<-rbind(demo.afalse, df.expanded)
demonew$b<-grepl(",", demonew$beta)
demonew.btrue <- demonew[ which(demonew$b=='TRUE'), ]
demonew.bfalse <- demonew[ which(demonew$b=='FALSE'), ]
demonew.btrue$beta<-as.character(demonew.btrue$beta)
temp<-strsplit(demonew.btrue$beta, ",")
temp.lengths<-lapply(temp, length)
for (i in 1:length(temp)) {
df.expanded1 <- demonew.btrue[rep(row.names(demonew.btrue), temp.lengths), 1:3]
}
temp.melt<-melt(temp)
df.expanded1$beta<-temp.melt$value
demonew.bfalse<-demonew.bfalse[c(1:3)]
demonew1<-rbind(df.expanded1, demonew.bfalse)
demonew1 #this seems to work, but doesn't feel very efficient
As well as perhaps not being very efficient, I am not sure whether this will work in all conditions. In particular if multiple numbers exist in both the 'alpha' and 'beta' variables for the same observation. I have tested it with a few examples and it seems ok, but I am not confident with it.
Thank you for any consideration.