3

In R, what is the fastest way to convert a list containing suites of character numbers (as character vectors) into numeric?

With the following dummy data:

set.seed(2)
N = 1e7
ncol = 10
myT = formatC(matrix(runif(N), ncol = ncol)) # A matrix converted to characters
# Each row is collapsed into a single suite of characters:
myT = apply(myT, 1, function(x) paste(x, collapse=' ') ) 
head(myT)

Producing:

[1] "0.1849 0.855 0.8272 0.5403 0.3891 0.5184 0.7776 0.5533 0.1566 0.01591"  
[2] "0.7024 0.1008 0.9442 0.8582 0.3184 0.9289 0.9957 0.1311 0.2131 0.07355" 
[3] "0.5733 0.5493 0.3915 0.4423 0.8522 0.6042 0.9265 0.006878 0.7052 0.71"   
[... etc ...] 

I could do

library(stringi) 
# In the actual dataset, the number of spaces between numbers may vary, hence "\\s+"
system.time(newT <- lapply(stri_split_regex(myT, "\\s+", omit_empty=T), as.numeric)) 
newT <- unlist(newT) # Final goal is to have a single vector of numbers

On my Intel Core i7 2.10GHz with 64-bit and 16GB system (under ubuntu):

   user  system elapsed 
  3.748   0.008   3.757 

With the real dataset (ncol=150 and N~1e9), this is way too long. Any better option?

ztl
  • 2,512
  • 1
  • 26
  • 40
  • What is "way too long" for you? What times are you getting and how fast is your system? – Mike Wise Oct 02 '15 at 08:50
  • I added info on my system and the times I get. "Way too long" means that, if I did it with the real dataset, it would take many hours, which is not an option as it should be done many times. Hence, (and quite independently of the time I get), I am just looking for the fastest way to achieve this in order to see whether I can do it or not. – ztl Oct 02 '15 at 09:18
  • 1
    I wonder how you got `myT`. Maybe you need to change a prior step. – Roland Oct 02 '15 at 09:28
  • Even if there is more than one space between each number, you'd do better using `stri_split_fixed`. `as.numeric` is unaffected by the leading or trailing whitespace. – A5C1D2H2I1M1N2O1R2T1 Oct 02 '15 at 10:14
  • Indeed @Roland, you spotted a trouble I encounter at a priori step (see here: http://stackoverflow.com/questions/32885570/fast-reading-by-chunk-and-processing-of-a-file-with-dummy-lines-at-regular-in). I obtain that character vector from `readLines`. – ztl Oct 07 '15 at 12:23
  • If your input file is as regular as you show there, you should probably pre-process with sed or awk or some other fast command line tool (to remove the lines you don't want) and than read with `fread`. – Roland Oct 07 '15 at 12:44
  • Thanks for the suggestion, @Roland. I don't know these tools, I'll have a look. In case you have an easy suggestion for that original problem, don't hesitate to answer there ;-) – ztl Oct 07 '15 at 12:51
  • 1
    http://stackoverflow.com/questions/9894986/how-can-i-delete-every-xth-line-in-a-text-file – Roland Oct 07 '15 at 12:57
  • @Roland Yes, and here http://stackoverflow.com/questions/5410757/delete-a-line-containing-a-specific-string-using-sed to remove on the basis of a pattern. Thanks - SO MUCH easier and faster (and smarter) than the `R`-only solution I was looking for. If you propose this as answer to my original question, I'd accept it (could be useful for posterity...) – ztl Oct 07 '15 at 13:20
  • 1
    Feel free to post an answer yourself. Note that `fread` accepts a shell command that preprocesses the file as input. – Roland Oct 07 '15 at 13:54

3 Answers3

2

This is twice as fast on my system:

x <- paste(myT, collapse = "\n")
library(data.table)
DT <- fread(x)
newT2 <- c(t(DT))
Roland
  • 127,288
  • 10
  • 191
  • 288
  • Thanks @Roland, I get the same type of improvement and this is the fastest solution so far. Simple and elegant, too - thanks! – ztl Oct 07 '15 at 08:54
2

I would suggest the "iotools" package, specifically the mstrsplit function. With that you would just do:

library(iotools)
newT <- as.vector(t(mstrsplit(myT, sep = " ", ncol = 10, type = "numeric")))

Get the "iotools" package on GitHub.


Timing comparisons:

OPFun <- function(myT) {
  newT <- lapply(stri_split_regex(myT, "\\s+", omit_empty=T), as.numeric)
  unlist(newT)
}

RolandFun <- function(myT) {
  x <- paste(myT, collapse = "\n")
  DT <- fread(x)
  newT2 <- c(t(DT))
  newT2
}

AMFun <- function(myT) {
  as.vector(t(mstrsplit(myT, sep = " ", ncol = 10, type = "numeric")))
}

system.time(OP <- OPFun(myT))
#    user  system elapsed 
#   3.920   0.004   3.917 
system.time(Roland <- RolandFun(myT))
#    user  system elapsed 
#   3.156   0.020   3.175 
system.time(AM <- AMFun(myT))
#    user  system elapsed 
#   0.664   0.016   0.676 

all.equal(OP, Roland)
# [1] TRUE
all.equal(Roland, AM)
# [1] TRUE
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • @ztl, I don't think it will matter. The multiple spaces would essentially collapse. – A5C1D2H2I1M1N2O1R2T1 Oct 02 '15 at 12:29
  • Thanks @Ananda Mahto, looks very promising and I'd like to test it on my real data, but... stupid question: how do I do if `sep` should be a more-than-one space in `mstrsplit`? Looks like it must be a one character value and I can't find the solution immediately...?! – ztl Oct 02 '15 at 12:29
  • unless I do something wrong, the several spaces do matter for my real situation as they affect `ncol` and produce NA's. The workaround I found is to omit `sep` from the arguments of `mstrsplit` and to do `newT <- newT[!is.na(newT)]` after. This is clearly faster than my solution, thanks! – ztl Oct 02 '15 at 13:13
  • @ztl, So, is the problem solved? Let me know. Thanks. – A5C1D2H2I1M1N2O1R2T1 Oct 02 '15 at 13:17
  • yes, I can implement your proposition as mentioned in my comment, thanks! This is an improvement - I am not accepting you answer yet, as I need to perform further tests to compare with other possibilities (and maybe another one, faster will pop up?). But yours is elegant and efficient, thanks! – ztl Oct 02 '15 at 15:51
0

mstrsplit(myT, sep = " ", type = "numeric")[, 1] is marginally faster. Note that the order of doing things influences performance. unlist(lapply(x, as.numeric)) is slower than as.numeric(unlist(x))

set.seed(2)
N = 1e4
ncol = 10
myT = formatC(matrix(runif(N), ncol = ncol)) # A matrix converted to characters
myT = apply(myT, 1, function(x) paste(x, collapse=' ') ) 
head(myT)

library(microbenchmark)
library(stringi) 
library(data.table)
library(iotools)
microbenchmark(
  original = {
    newT <- lapply(stri_split_regex(myT, "\\s+", omit_empty=T), as.numeric)
    unlist(newT)
  },
  data.table = {
    x <- paste(myT, collapse = "\n")
    DT <- fread(x)
    c(t(DT))
  },
  iotools = {
    as.vector(t(mstrsplit(myT, sep = " ", ncol = 10, type = "numeric")))
  },
  strsplit = {
    as.numeric(unlist(strsplit(myT, " ")))
  },
  original2 = {
     as.numeric(unlist(stri_split_regex(myT, "\\s+", omit_empty = TRUE)))
  },
  iotools2 = {
    mstrsplit(myT, sep = " ", type = "numeric")[, 1]
  }
)
Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval   cld
   original 52.03538 53.56949 56.02025 54.27165 55.40487  94.45513   100   c  
 data.table 93.10810 94.63730 98.04845 95.41537 96.51202 212.66666   100     e
    iotools 18.73776 19.44485 21.00974 19.75573 20.05614  42.47620   100 a    
   strsplit 67.04637 69.24053 70.58916 69.86529 70.95980  84.86132   100    d 
  original2 48.25558 49.47346 51.49833 50.14377 50.96139  84.22928   100  b   
   iotools2 18.53165 19.19126 19.72922 19.52567 19.71340  32.48726   100 a    
Thierry
  • 18,049
  • 5
  • 48
  • 66