4

I have a set of strings which contain space-separated elements. I want to build a matrix which will tell me which elements were part of which strings. For example:

""
"A B C"
"D"
"B D"

Should give something like:

  A B C D
1
2 1 1 1
3       1
4   1   1

Now I've got a solution, but it runs slow as molasse, and I've run out of ideas on how to make it faster:

reverseIn <- function(vector, value) {
    return(value %in% vector)
}

buildCategoryMatrix <- function(valueVector) {
    allClasses <- c()
    for(classVec in unique(valueVector)) {
        allClasses <- unique(c(allClasses,
                               strsplit(classVec, " ", fixed=TRUE)[[1]]))
    }

    resMatrix <- matrix(ncol=0, nrow=length(valueVector))
    splitValues <- strsplit(valueVector, " ", fixed=TRUE)

    for(cat in allClasses) {
        if(cat=="") {
            catIsPart <- (valueVector == "")
        } else {
            catIsPart <- sapply(splitValues, reverseIn, cat)
        }
        resMatrix <- cbind(resMatrix, catIsPart)
    }
    colnames(resMatrix) <- allClasses

    return(resMatrix)
}

Profiling the function gives me this:

$by.self
                  self.time self.pct total.time total.pct
"match"               31.20    34.74      31.24     34.79
"FUN"                 30.26    33.70      74.30     82.74
"lapply"              13.56    15.10      87.86     97.84
"%in%"                12.92    14.39      44.10     49.11

So my actual questions would be: - Where are the 33% spent in "FUN" coming from? - Would there be any way to speed up the %in% call?

I tried turning the strings into factors prior to going into the loop so that I'd be matching numbers instead of strings, but that actually makes R crash. I've also tried going for partial matrix assignment (IE, resMatrix[i,x] <- 1) where i is the number of the string and x is the vector of factors. No dice there either, as it seems to keep on running infinitely.

Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113
Eric Fournier
  • 93
  • 1
  • 5
  • 1
    See [this question and related answers](http://stackoverflow.com/questions/16267552/dummy-variables-from-a-string-variable) for an almost identical problem, just with numbers and a different value as a separator. – A5C1D2H2I1M1N2O1R2T1 Oct 25 '13 at 16:49
  • +1 for showing what effort you've taken and explaining what else you've tried though! – A5C1D2H2I1M1N2O1R2T1 Oct 25 '13 at 16:53
  • Thanks for pointing out the similar question. I had done a quick search, but I really had no idea on which phrase/keyword to look for for this specific problem. – Eric Fournier Oct 29 '13 at 13:07

3 Answers3

5

In the development version of my "splitstackshape" package, there's a helper function called charBinaryMat that can be used for something like this:

Here's the function (since the version of the package on CRAN doesn't have it yet):

charBinaryMat <- function(listOfValues, fill = NA) {
  lev <- sort(unique(unlist(listOfValues, use.names = FALSE)))
  m <- matrix(fill, nrow = length(listOfValues), ncol = length(lev))
  colnames(m) <- lev
  for (i in 1:nrow(m)) {
    m[i, listOfValues[[i]]] <- 1
  }
  m
}

The input is expected to be the output of strsplit:

And here it is in use:

str <- c("" , "A B C" , "D" , "B D" )

## Fill is `NA` by default
charBinaryMat(strsplit(str, " ", fixed=TRUE))
#       A  B  C  D
# [1,] NA NA NA NA
# [2,]  1  1  1 NA
# [3,] NA NA NA  1
# [4,] NA  1 NA  1

## Can easily be set to another value
charBinaryMat(strsplit(str, " ", fixed=TRUE), fill = 0)
#      A B C D
# [1,] 0 0 0 0
# [2,] 1 1 1 0
# [3,] 0 0 0 1
# [4,] 0 1 0 1

Benchmarking

Since your question is about a faster approach, let's benchmark.

  1. The functions for benchmarking:

    CBM <- function() {
      charBinaryMat(strsplit(str, " ", fixed=TRUE), fill = 0)
    }
    BCM <- function() {
      buildCategoryMatrix(str)*1L
    }
    Sapply <- function() {
      y <- unique( unlist( strsplit( str , " " ) ) )
      out <- t(sapply(str, function(x) y %in% unlist(strsplit(x , " " )),
                      USE.NAMES = FALSE )) * 1L
      colnames(out) <- y
      out
    }
    
  2. Some sample data:

    set.seed(1)
    A = sample(10, 100000, replace = TRUE)
    str <- sapply(seq_along(A), function(x)
      paste(sample(LETTERS[1:10], A[x]), collapse = " "))
    head(str)
    # [1] "H G C"               "F H J G"             "H D J A I B"        
    # [4] "A C F H J B E G D I" "F C H"               "I C G B J D F A E" 
    
  3. Some sample output:

    ## Automatically sorted
    head(CBM())
    #      A B C D E F G H I J
    # [1,] 0 0 1 0 0 0 1 1 0 0
    # [2,] 0 0 0 0 0 1 1 1 0 1
    # [3,] 1 1 0 1 0 0 0 1 1 1
    # [4,] 1 1 1 1 1 1 1 1 1 1
    # [5,] 0 0 1 0 0 1 0 1 0 0
    # [6,] 1 1 1 1 1 1 1 0 1 1
    
    ## Sorting just for comparison
    head(BCM())[, LETTERS[1:10]]
    #      A B C D E F G H I J
    # [1,] 0 0 1 0 0 0 1 1 0 0
    # [2,] 0 0 0 0 0 1 1 1 0 1
    # [3,] 1 1 0 1 0 0 0 1 1 1
    # [4,] 1 1 1 1 1 1 1 1 1 1
    # [5,] 0 0 1 0 0 1 0 1 0 0
    # [6,] 1 1 1 1 1 1 1 0 1 1
    
    ## Sorting just for comparison
    head(Sapply())[, LETTERS[1:10]]
    #      A B C D E F G H I J
    # [1,] 0 0 1 0 0 0 1 1 0 0
    # [2,] 0 0 0 0 0 1 1 1 0 1
    # [3,] 1 1 0 1 0 0 0 1 1 1
    # [4,] 1 1 1 1 1 1 1 1 1 1
    # [5,] 0 0 1 0 0 1 0 1 0 0
    # [6,] 1 1 1 1 1 1 1 0 1 1
    
  4. Benchmarking:

    library(microbenchmark)
    microbenchmark(CBM(), BCM(), Sapply(), times=20)
    # Unit: milliseconds
    #      expr        min         lq     median         uq        max neval
    #     CBM()   675.0929   718.3454   777.2423   805.3872   858.6609    20
    #     BCM() 11059.6305 11267.9888 11367.3283 11595.1758 11792.5950    20
    #  Sapply()  3536.7755  3687.0308  3759.7388  3813.4233  3968.3192    20
    
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Creating a benchmark which one function call takes 11 seconds seems a bit excessive when you have sub-microsecond precision. – hadley Oct 26 '13 at 12:53
  • Thanks! This works and is lighting fast, though I really wish I understood why it works so much better than anything else I tried. – Eric Fournier Oct 29 '13 at 13:08
  • By the way: I found out that your function will fail if one of the original strings has leading/trailing spaces, leading to an empty string in one of the vectors within the list. gsub("(^ +)|( +$)", "", originalVector) fixed that for me. – Eric Fournier Oct 29 '13 at 13:13
  • @EricFournier, I imagine that would be the case for all the answers, right? At any rate, in the "parent" function, [there is a "trim" command already](https://github.com/mrdwab/splitstackshape/blob/devel/R/concat.split.R#L86), but perhaps I will see whether it makes sense in the main function too. I think it would only matter when `sep = " "` but will have to test some more. Thanks for the comment. – A5C1D2H2I1M1N2O1R2T1 Oct 29 '13 at 13:31
4

This is pretty easy to do with vapply:

x <- c("" , "A B C" , "D" , "B D" )
lines <- strsplit(x, " ", fixed = TRUE)

all <- sort(unique(unlist(lines)))

t(vapply(lines, function(x) all %in% x, numeric(length(all))))  

This is a little slower than @Ananda's approach: https://gist.github.com/hadley/7169138

hadley
  • 102,019
  • 32
  • 183
  • 245
  • 3
    I'm debating whether to tell you that your benchmarks are inaccurate because you've hard-coded `x` into your function in your gist. – A5C1D2H2I1M1N2O1R2T1 Oct 26 '13 at 14:50
  • 2
    You have an O(1) algorithm in your timing gist! :-) But nice to see this alternative to `sapply` that is a lot cleaner to boot. – Simon O'Hanlon Oct 26 '13 at 15:56
  • @hadley, I really wanted yours to be faster. `vapply` is one of those functions that I need to learn, so any examples like these are greatly appreciated. – A5C1D2H2I1M1N2O1R2T1 Oct 27 '13 at 04:25
  • 1
    @AnandaMahto basically you should never use `sapply` in a function, and instead use `vapply`. It's a little faster, but more importantly always returns an object of the same type/size. – hadley Oct 27 '13 at 12:35
2

Here's one way of doing this. There is a lot going on in the line where out is assigned. Basically, we loop over each element of your input vector. We split each element into individual characters, then we look to see which of these is present in a vector of all the unique values in your dataset. This returns either TRUE or FALSE. We use * 1L at the end to turn logical values into integer but you could just wrap the whole thing in as.integer instead. sapply returns the results column-wise but you want them row-wise so we use the transpose function t() to achieve this.

The final line converts to a data.frame and applies column names.

#  Data
str <- c("" , "A B C" , "D" , "B D" )

#  Unique column headers (excluding empty strings as in example)
y <- unique( unlist( strsplit( str , " " ) ) )

#  Results
out <- t( sapply( str , function(x) y %in% unlist( strsplit( x , " " ) ) , USE.NAMES = FALSE ) ) * 1L

#  Combine to a data.frame
setNames( data.frame( out ) , y )
#  A B C D
#1 0 0 0 0
#2 1 1 1 0
#3 0 0 0 1
#4 0 1 0 1
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184