4

This is a follow up to this question: Duplicating observations of a dataframe, but also replacing specific variable values in R

I have tried to write as succinctly as possible, whilst giving all necessary information. In this current example, I have a df that looks like this:

df<-data.frame(alpha=c(1, "3, 4", "2, 4, 5", 2, 1, 3, "1, 2", "1, 2, 3"), 
           beta=c("2, 4", "3, 4", 1, 3, 3, "1, 4", "1, 2", "1, 2, 3"),
           color=c("red", "yellow"))


#    alpha    beta  color
#1       1    2, 4    red
#2    3, 4    3, 4 yellow
#3 2, 4, 5       1    red
#4       2       3 yellow
#5       1       3    red
#6       3    1, 4 yellow
#7    1, 2    1, 2    red
#8 1, 2, 3 1, 2, 3 yellow

Desired output

I want it to end up looking something like this (the order of the rows isn't that important to me):

#  alpha beta  color value
#1     1    2    red     1
#2     1    4    red     1
#3     3    4 yellow   0.5
#4     2    1    red     1
#5     4    1    red     1
#6     5    1    red     1
#7     2    3 yellow     1
#8     1    3    red     1
#9     3    1 yellow     1
#10    3    4 yellow     1
#11    1    2    red   0.5
#12    1    2 yellow   0.5
#13    1    3 yellow   0.5
#14    2    3 yellow   0.5

Whenever there is an observation of alpha or beta with multiple non-overlapping/non-matching entries, then that observation is expanded. e.g. Line 1 of the original df has alpha="1" and beta="2, 4". These become two separate rows in the output with the "2" and "4" of beta separated. The information kept in the variable 'color' is kept for each observation. Additionally, a new variable 'value' gets a 1.

The key difference between this and the previous question is the presence of rows where alpha==beta .

Here, I only want to keep unique combinations of alpha and beta, and not duplicates. e.g. row 8 of the original df has alpha="1, 2, 3" and beta="1, 2, 3". I want to have separate rows where "alpha=1, beta=2", "alpha=1,beta=3", "alpha=2,beta=3". Again, the 'color' variable will be duplicated. However, here, value needs to be '0.5'.

What I have tried:

I couldn't figure out how to do this all in one, so I first subset the df according to whether alpha is equal to beta or not.

x <- df[df$alpha!=df$beta,]

df$alpha<-as.character(df$alpha)
df$beta<-as.character(df$beta)

fun1 <- function(df){
df$alpha<-as.character(df$alpha)
df$beta<-as.character(df$beta)

do.call(rbind, with(df, Map(expand.grid, 
                            alpha = strsplit(alpha, ", "),
                            beta = strsplit(beta, ", "),
                            color = color,
                            value = 1
)))
}

fun1(x)

which gives:

  alpha beta  color value
#1     1    2    red     1
#2     1    4    red     1
#3     2    1    red     1
#4     4    1    red     1
#5     5    1    red     1
#6     2    3 yellow     1
#7     1    3    red     1
#8     3    1 yellow     1
#9     3    4 yellow     1

Then the observations where alpha==beta , which is where I'm getting stuck ...

x2 <- df[df$alpha==df$beta,]
x2

#    alpha    beta  color
#2    3, 4    3, 4 yellow
#7    1, 2    1, 2    red
#8 1, 2, 3 1, 2, 3 yellow

My idea was to split the data using strsplit and then to use combn to find the combinations I'm looking for and to bind back together. However, this is not going to work like this if I want to duplicate other variables...

a<-strsplit(x2$alpha, ", ")
a.combs <- lapply(a, function(x) c(combn(x, 2, simplify=FALSE)))  
matrix(unlist(a.combs),ncol=2, byrow=T)

gives:

#    [,1] [,2]
#[1,] "3"  "4" 
#[2,] "1"  "2" 
#[3,] "1"  "2" 
#[4,] "1"  "3" 
#[5,] "2"  "3" 

Any ideas on how to do get these combinations along with the pre-existing 'color' variable and adding in the new 'value' variable are greatly appreciated.

Community
  • 1
  • 1
jalapic
  • 13,792
  • 8
  • 57
  • 87

1 Answers1

5

I've just done this quickly, so I'm not sure if it captures all of your conditions, but this seems to work.

It starts from pretty much the same point, using my cSplit function, nested twice, like this (but this time, adding an ID):

library(devtools)
source_gist(11380733)
temp <- cSplit(cSplit(cbind(id = 1:nrow(df), df),
                      "alpha", ",", "long"), 
               "beta", ",", "long")

Here's the new stuff:

SD <- c("alpha", "beta")
## Convert "alpha" and "beta" to numeric
temp[, (SD) := lapply(.SD, as.numeric), .SDcols = SD]

## Sort your alphas and betas, and check for duplicates
##   and any points where alpha equals beta
temp[, toDrop := duplicated(
  paste(pmin(alpha, beta), pmax(alpha, beta))) |
    alpha == beta, by = id]

## Create your "value" column
temp[, value := ifelse(any(toDrop), 0.5, 1), by = id]

## Subset and drop the irrelevant columns
out <- temp[!temp[, toDrop, with = TRUE]][, toDrop := NULL]

This is the output:

out
#     id alpha beta  color value
#  1:  1     1    2    red   1.0
#  2:  1     1    4    red   1.0
#  3:  2     3    4 yellow   0.5
#  4:  3     2    1    red   1.0
#  5:  3     4    1    red   1.0
#  6:  3     5    1    red   1.0
#  7:  4     2    3 yellow   1.0
#  8:  5     1    3    red   1.0
#  9:  6     3    1 yellow   1.0
# 10:  6     3    4 yellow   1.0
# 11:  7     1    2    red   0.5
# 12:  8     1    2 yellow   0.5
# 13:  8     1    3 yellow   0.5
# 14:  8     2    3 yellow   0.5
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Wow, this is great - thanks. It is motivating me to learn a lot more about data.table. The csplit function is fantastic. I have tested it with various data samples and it seems to work well. Thanks again. – jalapic Sep 18 '14 at 13:55
  • hope you see this comment. This function still works fine when I just run in an R script. However, when I put it into an R package it doesn't work - the following errors occur... `Error in `:=`((SD), lapply(.SD, as.numeric)) : Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").` i believe it's to do with updates to `data.table` ? but I can't see why it wouldn't work in my package. I wonder if you have any ideas? It's hard to make a repro example without all the package info. – jalapic Mar 07 '15 at 01:31
  • @jalapic, do you have the repo somewhere like GitHub and if so, is it possible for me to see it? Are there any lines in the code that re-convert the `data.table` to a `data.frame` before getting to the `as.numeric` step? – A5C1D2H2I1M1N2O1R2T1 Mar 07 '15 at 02:09
  • I will set that up - might not be until tomorrow now, but thanks for the comment. Will let you know when I've done it. Thanks. – jalapic Mar 07 '15 at 02:27
  • I have added a gist here: https://gist.github.com/jalapic/4e0b26bd74d3b3751d7e and the R package is on github here: https://github.com/jalapic/curleylab/tree/master - As you can see, the function works fine if loaded into the global environment, but not if loaded as part of the R package. Yet, the function is identical? Any help appreciated - thanks ! – jalapic Mar 08 '15 at 18:43
  • @jalapic, from what I can see, the main issue is that "splitstackshape" and "data.table" are not in the search path because they are not in the namespace. You need to either use "import" or "import from". I've created two pull requests with the changes I made that seem to make the function work in the package too. – A5C1D2H2I1M1N2O1R2T1 Mar 09 '15 at 02:32
  • Thanks - got it. They do work - I did leave an extra question on the pull request. I'm a bit confused why listing the packages under 'imports' on the DESCRIPTION file is not sufficient? – jalapic Mar 09 '15 at 02:46