5

I try to split a vector of strings into a data.frame object and for a fixed order this isn't a problem (e.g. like written here), but in my particular case the columns for the future data-frame are not complete in the string objects. This is how the output should look like for an toy input:

input <- c("an=1;bn=3;cn=45",
           "bn=3.5;cn=76",
           "an=2;dn=5")

res <- do.something(input)

> res
      an  bn  cn  dn
[1,]  1   3   45  NA
[2,]  NA  3.5 76  NA
[3,]  2   NA  NA  5

I am looking now for a function do.somethingthat can do that in a efficient way. My naive solution at the moment would be to loop over the input objects, strsplit those for ; then strsplit them again for = and then fill the data.frame result by result. Is there any way to do that more R-alike? I am afraid doing that element by element would take quite a long time for a long vector input.

EDIT: Just for completeness, my naive solution looks like this:

  do.something <- function(x){
    temp <- strsplit(x,";")
    temp2 <- sapply(temp,strsplit,"=")
    ul.temp2 <- unlist(temp2)
    label <- sort(unique(ul.temp2[seq(1,length(ul.temp2),2)]))
    res <- data.frame(matrix(NA, nrow = length(x), ncol = length(label)))
    colnames(res) <- label
    for(i in 1:length(temp)){
      for(j in 1:length(label)){
        curInfo <- unlist(temp2[[i]])
        if(sum(is.element(curInfo,label[j]))>0){
          res[i,j] <- curInfo[which(curInfo==label[j])+1]
        }
      }
    }
    res
  }

EDIT2: Unfortunately my large input data looks like this (entries without '=' possible):

input <- c("an=1;bn=3;cn=45",
           "an;bn=3.5;cn=76",
           "an=2;dn=5")

so I cannot compare the given answers to my problem at hand. My naive solution for that is

do.something <- function(x){
    temp <- strsplit(x,";")
    tempNames <- sort(unique(sapply(strsplit(unlist(temp),"="),"[",1)))
    res <- data.frame(matrix(NA, nrow = length(x), ncol = length(tempNames)))
    colnames(res) <- tempNames

    for(i in 1:length(temp)){
      curSplit <- strsplit(unlist(temp[[i]]),"=")
      curNames <- sapply(curSplit,"[",1)
      curValues <- sapply(curSplit,"[",2)
      for(j in 1:length(tempNames)){
        if(is.element(colnames(res)[j],curNames)){
          res[i,j] <- curValues[curNames==colnames(res)[j]]
        }
      }
    }
    res
  }
Community
  • 1
  • 1
Daniel Fischer
  • 3,280
  • 1
  • 18
  • 29
  • are your column names always two characters long? – Simon O'Hanlon Nov 12 '13 at 11:53
  • Okay, sorry that was misleading. No, they aren't. They can be everything between 2 and 10 Characters. – Daniel Fischer Nov 12 '13 at 11:54
  • I edited my solution. It now uses just base package and should handle missing numbers efficiently. – Simon O'Hanlon Nov 12 '13 at 15:43
  • The solutions using `rbind.fill` are all good, but they'll be *terribly slow*. [Check this post for fast solutions](http://stackoverflow.com/questions/17308551/do-callrbind-list-for-uneven-number-of-column/17309310#17309310). I think that's what you're looking for. I've not checked Simon's answer yet, but I'm guessing that'll be better than the `rbind.fill` based solutions. – Arun Nov 13 '13 at 12:25
  • Thanks a lot, I'll try it out. As you wrote, the plyr are still pretty slow (although they are way faster than the naive solution). It takes about 5 hours to import my data that way (with 150k rows and 15 expected columns). – Daniel Fischer Nov 14 '13 at 07:05

4 Answers4

4

This is a kind of bad techniq but sometimes ept( eval parse text) is useful.

> library(plyr)
> rbind.fill(lapply(input, function(x) {l <- new.env(); eval(parse(text = x), envir=l); as.data.frame(as.list(l))}))
  an cn  bn dn
1  1 45 3.0 NA
2 NA 76 3.5 NA
3  2 NA  NA  5

Update

> z <- lapply(strsplit(input, ";"), 
+             function(x) {
+               e <- Filter(function(y) length(y)==2, strsplit(x, "="))
+               r <- data.frame(lapply(e, `[`, 2))
+               names(r) <- lapply(e, `[`, 1)
+               r
+             })
> rbind.fill(z)
    an   bn   cn   dn
1    1    3   45 <NA>
2 <NA>  3.5   76 <NA>
3    2 <NA> <NA>    5
kohske
  • 65,572
  • 8
  • 165
  • 155
  • Thanks, this looks way more concise than my solution. Unfortunately I cannot compare the timings between the solutions, because my input looks in fact slightly different than I thought, hence this solution doesn't work on it (See the EDIT2). But still, as this solution solved the initial problem I'll accept it. – Daniel Fischer Nov 12 '13 at 13:15
  • Great, thanks! Compared to my above given naive solution this is about 9 times faster! – Daniel Fischer Nov 12 '13 at 13:58
4

Here's another way which should work even given your edited data. Extract the column names and values from your input vector using regmatches, then run through each list element matching the values to the appropriate column names.

#  Get column names
tag <- regmatches( input , gregexpr( "[a-z]+" , input ) )

#  Get numbers including floating point, replace missing values with NA
val <- regmatches( input , gregexpr( "\\d+\\.?\\d?|(?<=[a-z]);" , input , perl = TRUE ) )
val <- lapply( val , gsub , pattern = ";" , replacement = NA )

#  Column names
nms <- unique( unlist(tag) )

#  Intermeidate matrices
ll <- mapply( cbind , val , tag )

#  Match to appropriate columns and coerce to data.frame
out <- data.frame( do.call( rbind , lapply( ll , function(x) x[ match( nms , x[,2] ) ]  ) ) )
names(out) <- nms
#    an   bn   cn   dn
#1    1    3   45 <NA>
#2 <NA>  3.5   76 <NA>
#3    2 <NA> <NA>    5
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
2

Not really efficient, and use an external package.

  1. convert each line to a data.frame
  2. rbinds them using rbind.fill from plyr

Here my code:

ll <- lapply(input,function(x){
        xx <- unlist(strsplit(x,";"))
        nn <- sub('([a-z]+)[=].*','\\1',xx)
        vv <- sub('([a-z]+)[=]([0-9]+([.][0-9]+)?)','\\2',xx)
        m <- t(data.frame(vv))
        colnames(m) <- nn
        as.data.frame(m)
})

library(plyr)
rbind.fill(ll)

rbind.fill(ll)
    an   bn   cn   dn
1    1    3   45 <NA>
2 <NA>  3.5   76 <NA>
3    2 <NA> <NA>    5
agstudy
  • 119,832
  • 17
  • 199
  • 261
1

One more variation on the rbind.fill theme:

library(plyr)

mini.df <- function(x) {
  y <- do.call(cbind,strsplit(x,"="))
  z <- as.numeric(y[2,])
  names(z) <- y[1,]
  return(as.data.frame(t(z)))
}
res <- rbind.fill(lapply(strsplit(input,";"),mini.df))

This is actually very similar to the other two solutions. I just created the dataframes slightly differently.

Sam Dickson
  • 5,082
  • 1
  • 27
  • 45