0

I have a table cluster (with more than one column):

head(cluster[,c('cuil_direccion')])
[1] "PJE INDEA 98 5                    "
[2] "PJE INDE 98 5                    "
[3] "B 34 VIV RECRE 57 00                 "
[4] "S CASA DE GO 600                  "
[5] "RCCA 958 00 o                             "
[6] "JUAN B  1900                       "

I need to run a function that for each line extracts the numbers and paste them in a list. I'm using: str_extract_all. Since the table is huge I'd like to split data and use different cores for each split. I tried:

library(foreach)
library(doParallel)
registerDoParallel(cores=detectCores(all.tests=TRUE))

crea_tabla <- function(x){
  xlst <- split(x, 1:nrow(x)) 
  pred <- foreach(i = xlst, .combine = rbind) %dopar% {
    library(stringr)
    d<-data.frame(dir='a', E_numdir=1)
    j=1  
    DIR<-i$cuil_direccion[j]
    E_NUMDIR <- str_extract_all(DIR,"\\(?[0-9]+\\)?")[[1]]
    d<-rbind(d, data.frame( dir=DIR , 
                         E_numdir=toString(E_NUMDIR)))
    j=1+j    
  }
}

then I ran

crea_tabla(cluster)

And I get an empty result.

I'm not sure about the way doparallel uses data. E.G this part:

 library(stringr)
    d<-data.frame(dir='a', E_numdir=1)
    j=1  

Should I write before or after %dopar%?

EDITION

num_cores<-detectCores(all.tests=TRUE)
registerDoParallel(cores=detectCores(all.tests=TRUE))



crea_tabla <- function(x, num_cores){
  xlst <- split(x, 1:nrow(x)) 
  j=1 
  d<-data.frame(dir='a', E_numdir=1) 
  pred <- foreach(i = seq_along(xlst), .combine = rbind) %dopar% {
  print(i*num_cores/nrow(x))
    library(stringr)
    DIR<-xlst[[i]]$cuil_direccion
    E_NUMDIR <- str_extract_all(DIR,"\\(?[0-9]+\\)?")[[1]]
    data.frame(dir=DIR , E_numdir=toString(E_NUMDIR))    
  }
  d <- rbind(d, pred)
  return(d)
}

a<-crea_tabla(cluster, num_cores)
GabyLP
  • 3,649
  • 7
  • 45
  • 66

1 Answers1

2

There are several things you need to make note of. First, you are correct to be suspicious of where you put initialized variables. You should declare them before the loop (no point in reloading the library several times). Second, you don't need the j variable. Just seq_along your list and index your list.

Next, regarding foreach, you have specified that the output will be rbind so you have not need to call rbind inside the loop. If you want that first row, you just rbind the results of the foreach loop to the initial data.frame. The following accomplishes what it appears you are trying to do.

Lastly, I assume you realize this, but make sure you set up your backend. I don't know which OS you are using but you would need to use another package like doParallel, doMC or doSNOW.

# recreate your data
cluster <- read.table(header=F, text='
"PJE INDEA 98 5                    "
"PJE INDE 98 5                    "
"B 34 VIV RECRE 57 00                 "
"S CASA DE GO 600                  "
"RCCA 958 00 o                             "
"JUAN B  1900                       "
')
colnames(cluster) <- 'cuil_direccion'

library(stringr)
library(foreach)

crea_tabla <- function(x){
    xlst <- split(x, 1:nrow(x)) 
    j=1 
    d<-data.frame(dir='a', E_numdir=1) 
    pred <- foreach(i = seq_along(xlst), .combine = rbind) %dopar% {
        DIR<-xlst[[i]]$cuil_direccion
        E_NUMDIR <- str_extract_all(DIR,"\\(?[0-9]+\\)?")[[1]]
        data.frame(dir=DIR , E_numdir=toString(E_NUMDIR))    
    }
    d <- rbind(d, pred)
    return(d)
}

crea_tabla(cluster)

                                         dir   E_numdir
1                                          a          1
2         PJE INDEA 98 5                          98, 5
3          PJE INDE 98 5                          98, 5
4      B 34 VIV RECRE 57 00                  34, 57, 00
5         S CASA DE GO 600                          600
6 RCCA 958 00 o                                 958, 00
7        JUAN B  1900                              1900
cdeterman
  • 19,630
  • 7
  • 76
  • 100
  • thanks! can I also add a print to know what % was done at each moment? Something like print(i* #cores/nrow(x)) ? – GabyLP Jan 27 '15 at 18:22
  • You can add another argument and pass the #cores and use the `print` statement. Just make sure you don't put it at the end of the `foreach` or it will try and `rbind` that % together instead of your data.frames!!! – cdeterman Jan 27 '15 at 18:29
  • thanks. But please see the edition,. I still don't get the print- – GabyLP Jan 27 '15 at 18:39
  • 1
    Ah yes, I forgot about the parallelization issue. This has been discussed several times on here. See [this](http://stackoverflow.com/questions/5423760/how-do-you-create-a-progress-bar-when-using-the-foreach-function-in-r) and [this](http://stackoverflow.com/questions/10903787/how-can-i-print-when-using-dopar) question. I am unaware of any general use solution that has been developed. I think you will be stuck with creating a log file if this is expected to take a long time. – cdeterman Jan 27 '15 at 18:56