strsplit into data.frame with incomplete input

Question

I try to split a vector of strings into a data.frame object and for a fixed order this isn't a problem (e.g. like written here), but in my particular case the columns for the future data-frame are not complete in the string objects. This is how the output should look like for an toy input:

input <- c("an=1;bn=3;cn=45",
           "bn=3.5;cn=76",
           "an=2;dn=5")

res <- do.something(input)

> res
      an  bn  cn  dn
[1,]  1   3   45  NA
[2,]  NA  3.5 76  NA
[3,]  2   NA  NA  5

I am looking now for a function do.somethingthat can do that in a efficient way. My naive solution at the moment would be to loop over the input objects, strsplit those for ; then strsplit them again for = and then fill the data.frame result by result. Is there any way to do that more R-alike? I am afraid doing that element by element would take quite a long time for a long vector input.

EDIT: Just for completeness, my naive solution looks like this:

  do.something <- function(x){
    temp <- strsplit(x,";")
    temp2 <- sapply(temp,strsplit,"=")
    ul.temp2 <- unlist(temp2)
    label <- sort(unique(ul.temp2[seq(1,length(ul.temp2),2)]))
    res <- data.frame(matrix(NA, nrow = length(x), ncol = length(label)))
    colnames(res) <- label
    for(i in 1:length(temp)){
      for(j in 1:length(label)){
        curInfo <- unlist(temp2[[i]])
        if(sum(is.element(curInfo,label[j]))>0){
          res[i,j] <- curInfo[which(curInfo==label[j])+1]
        }
      }
    }
    res
  }

EDIT2: Unfortunately my large input data looks like this (entries without '=' possible):

input <- c("an=1;bn=3;cn=45",
           "an;bn=3.5;cn=76",
           "an=2;dn=5")

so I cannot compare the given answers to my problem at hand. My naive solution for that is

do.something <- function(x){
    temp <- strsplit(x,";")
    tempNames <- sort(unique(sapply(strsplit(unlist(temp),"="),"[",1)))
    res <- data.frame(matrix(NA, nrow = length(x), ncol = length(tempNames)))
    colnames(res) <- tempNames

    for(i in 1:length(temp)){
      curSplit <- strsplit(unlist(temp[[i]]),"=")
      curNames <- sapply(curSplit,"[",1)
      curValues <- sapply(curSplit,"[",2)
      for(j in 1:length(tempNames)){
        if(is.element(colnames(res)[j],curNames)){
          res[i,j] <- curValues[curNames==colnames(res)[j]]
        }
      }
    }
    res
  }

Okay, sorry that was misleading. No, they aren't. They can be everything between 2 and 10 Characters. — Daniel Fischer, Nov 12 '13 at 11:54
I edited my solution. It now uses just base package and should handle missing numbers efficiently. — Simon O'Hanlon, Nov 12 '13 at 15:43
The solutions using `rbind.fill` are all good, but they'll be *terribly slow*. [Check this post for fast solutions](http://stackoverflow.com/questions/17308551/do-callrbind-list-for-uneven-number-of-column/17309310#17309310). I think that's what you're looking for. I've not checked Simon's answer yet, but I'm guessing that'll be better than the `rbind.fill` based solutions. — Arun, Nov 13 '13 at 12:25
Thanks a lot, I'll try it out. As you wrote, the plyr are still pretty slow (although they are way faster than the naive solution). It takes about 5 hours to import my data that way (with 150k rows and 15 expected columns). — Daniel Fischer, Nov 14 '13 at 07:05

kohske · Answer 1 · 2013-11-12T13:37:29.503

4

This is a kind of bad techniq but sometimes ept( eval parse text) is useful.

> library(plyr)
> rbind.fill(lapply(input, function(x) {l <- new.env(); eval(parse(text = x), envir=l); as.data.frame(as.list(l))}))
  an cn  bn dn
1  1 45 3.0 NA
2 NA 76 3.5 NA
3  2 NA  NA  5

Update

> z <- lapply(strsplit(input, ";"), 
+             function(x) {
+               e <- Filter(function(y) length(y)==2, strsplit(x, "="))
+               r <- data.frame(lapply(e, `[`, 2))
+               names(r) <- lapply(e, `[`, 1)
+               r
+             })
> rbind.fill(z)
    an   bn   cn   dn
1    1    3   45 <NA>
2 <NA>  3.5   76 <NA>
3    2 <NA> <NA>    5

edited Nov 12 '13 at 13:37

answered Nov 12 '13 at 12:03

kohske

65,572
8
165
155

Thanks, this looks way more concise than my solution. Unfortunately I cannot compare the timings between the solutions, because my input looks in fact slightly different than I thought, hence this solution doesn't work on it (See the EDIT2). But still, as this solution solved the initial problem I'll accept it. – Daniel Fischer Nov 12 '13 at 13:15
Great, thanks! Compared to my above given naive solution this is about 9 times faster! – Daniel Fischer Nov 12 '13 at 13:58

Simon O'Hanlon · Accepted Answer · 2013-11-12T15:38:58.567

4

Here's another way which should work even given your edited data. Extract the column names and values from your input vector using regmatches, then run through each list element matching the values to the appropriate column names.

#  Get column names
tag <- regmatches( input , gregexpr( "[a-z]+" , input ) )

#  Get numbers including floating point, replace missing values with NA
val <- regmatches( input , gregexpr( "\\d+\\.?\\d?|(?<=[a-z]);" , input , perl = TRUE ) )
val <- lapply( val , gsub , pattern = ";" , replacement = NA )

#  Column names
nms <- unique( unlist(tag) )

#  Intermeidate matrices
ll <- mapply( cbind , val , tag )

#  Match to appropriate columns and coerce to data.frame
out <- data.frame( do.call( rbind , lapply( ll , function(x) x[ match( nms , x[,2] ) ]  ) ) )
names(out) <- nms
#    an   bn   cn   dn
#1    1    3   45 <NA>
#2 <NA>  3.5   76 <NA>
#3    2 <NA> <NA>    5

edited Nov 12 '13 at 15:38

answered Nov 12 '13 at 12:17

Simon O'Hanlon

58,647
14
142
184

Thanks for a 'base' solution! I'll try it out and compare then timings. – Daniel Fischer Nov 14 '13 at 07:06
I just tried it and it seems very much faster! I still have to adjust some small things about the regmatches, but still this finishes in minutes instead hours. – Daniel Fischer Nov 15 '13 at 06:35
@DanielFischer I am very glad this was useful for you! Great stuff. cheers :-) – Simon O'Hanlon Nov 15 '13 at 09:18

score 2 · Answer 3 · answered Nov 12 '13 at 11:59

Not really efficient, and use an external package.

convert each line to a data.frame
rbinds them using rbind.fill from plyr

Here my code:

ll <- lapply(input,function(x){
        xx <- unlist(strsplit(x,";"))
        nn <- sub('([a-z]+)[=].*','\\1',xx)
        vv <- sub('([a-z]+)[=]([0-9]+([.][0-9]+)?)','\\2',xx)
        m <- t(data.frame(vv))
        colnames(m) <- nn
        as.data.frame(m)
})

library(plyr)
rbind.fill(ll)

rbind.fill(ll)
    an   bn   cn   dn
1    1    3   45 <NA>
2 <NA>  3.5   76 <NA>
3    2 <NA> <NA>    5

Thanks for this solution, unfortunately I cannot accept all answers but at least +1. — Daniel Fischer, Nov 12 '13 at 13:16

Sam Dickson · Answer 4 · 2013-11-12T12:58:07.907

1

One more variation on the rbind.fill theme:

library(plyr)

mini.df <- function(x) {
  y <- do.call(cbind,strsplit(x,"="))
  z <- as.numeric(y[2,])
  names(z) <- y[1,]
  return(as.data.frame(t(z)))
}
res <- rbind.fill(lapply(strsplit(input,";"),mini.df))

This is actually very similar to the other two solutions. I just created the dataframes slightly differently.

edited Nov 12 '13 at 12:58

answered Nov 12 '13 at 12:28

Sam Dickson

5,082
1
27
45

Thanks for this solution, unfortunately I cannot accept all answers but at least +1. – Daniel Fischer Nov 12 '13 at 13:20

strsplit into data.frame with incomplete input

4 Answers4