1

I have the following vector:

my.vector = c("4M1D5M15I1D10M", "3M", "4M2I3D")

And I'd like to transform it into the following vector:

my.result = c("21N", "3N", "7N")

The logic for such results is as follows, for "4M1D5M15I1D10M" I added all the numbers, except the ones that are preceding an "I" character, i.e., 4+1+5+1+10=21 (I did not add 15 because it precedes an "I"), and then paste an N right after 21, becoming "21N".

Same for "3M", there is no "I" character so it just becomes "3N"; and same for the last one, 4+3=7 (I did not add 2 because it precedes an "I"), becoming "7N".

Note that my.vector is extremely large so I want to use the parallel capabilities of the HPC server using mclapply. Ideally I'd run something like this to get my result:

my.result = unlist(mclapply(my.vector, my.adding.function, mc.cores = ncores))

For defining my function I tried the following:

my.adding.function <- function(x)
{
   tmp = unlist(strsplit(x, "\\d+I"))
   tmp2 = unlist(strsplit(tmp, "M|D|S|N"))
   tmp3 = sum(as.numeric(tmp2))
   return(paste(tmp3, "N",sep=""))
}

Not sure about the efficiency of such function though...

Dnaiel
  • 7,622
  • 23
  • 67
  • 126
  • @Gsee, good point, putting it up there. – Dnaiel Sep 28 '13 at 19:06
  • @Gsee, basically, I am trying to learn how to work in R such that operations do not take extremely long time for huge data. I have some code implemented but when I run it it takes forever and never finishes it so I am trying to optimize each step, thus the mclapply, etc ... I can write my own functions that do most of the stuff I want but they end up being quite slow. – Dnaiel Sep 28 '13 at 19:13
  • 1
    It's unlikely that the best solution is to use `mclapply` for every little operation in your code because there is overhead associated with collecting the results. You're probably better off, vectorizing as much as you can, and using `mclapply` on bigger chunks of logic, but I'm just guessing since you're only showing us one tiny piece of your project at a time. – GSee Sep 28 '13 at 19:20
  • @Gsee, it makses sense, this is a good advice, I am now printing time stamps after every operation so this will give me a good idea of the bottlenecks. as a side note, mclapply overhead is much better than foreach and parLapply for little operations. – Dnaiel Sep 28 '13 at 19:24
  • @Gsee, I'd gladly put my whole code for optmization but I am trying to be mindful. It'd be nice if there was a code review session :-) just wishful thinking though... – Dnaiel Sep 28 '13 at 19:26
  • 1
    You need to profile your code. See http://stackoverflow.com/a/2075404/967840, http://www.stat.berkeley.edu/~nolan/stat133/Fall05/lectures/profilingEx.html, or http://stackoverflow.com/questions/3650862/how-to-efficiently-use-rprof-in-r – GSee Sep 28 '13 at 19:30

1 Answers1

1

Here is one solution without mclapply, please check if it is feasible:

L <- regmatches(my.vector, gregexpr("(\\d+)(?=[A-HJ-Z])", my.vector, perl=TRUE))
sapply(L, function(x)paste0(sum(as.numeric(x)),"N"))
Ferdinand.kraft
  • 12,579
  • 10
  • 47
  • 69