0

I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.

I've got a vector of 200000 line containing street adresses (String) : data. Example :

> data[150000,]
                              address 
"15 rue andre lalande residence marguerite yourcenar 91000 evry france" 

And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams

Example of list_ngrams :

  idSac ngram
1     4 stree
2     4 tree_ 
3     4 _stre
4     4 treet
5     5 avenu
6     5 _aven
7     5 venue
8     5 enue_

I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags

In total I have 131 5-grams and 31 bags of 5-grams.

I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram. Example :

In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :

> idv_x_sacs[150000,]
  4   5   6   8  10  12  13  15  17  18  22  26  29  34  35  36  42  43  45  46  47  48  52  55  81  82 108 114 119 122 123 
  0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 

Here is the code that does :

idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac

    for(i in 1:nrow(idv_x_bags)) 
    {
        for(ngram in list_ngrams$ngram)
      {
        if(grepl(ngram,data[i,])==TRUE)
        {
          idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
          idv_x_bags[i,as.character(idSac)] <- 1
        }
      }
    }

The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it. Here is what I did :

apply(cbind(data,1:nrow(data),1,function(x){
  apply(list_ngrams,1,function(y){
   if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1} 
  })
}) 

I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.

  • 5
    Please see [these instruction on how to give a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Jaap Aug 11 '15 at 09:44

1 Answers1

1

Check this one and run the simple example step by step to see how it works. My N-Grams don't make much sense, but it will work with actual N_Grams as well.

 library(dplyr)
 library(reshape2)

 # your example dataset
 dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
 dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
                     ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)

 # sentence dataset
 dt_sen

sen
    1 this is a good thing
    2          this is bad


 #ngrams dataset
 dt_ngr

  id_ngr ngr
1      2  th
2      2  go
3      2  tt
4      3 drf
5      3 ytu
6      3 bad



 # create table of matches
 expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
   data.frame() %>%
   rename(sen = Var1,
          id_ngr = Var2) %>%
   left_join(dt_ngr, by = "id_ngr") %>%
   group_by(sen, id_ngr,ngr) %>%
   do(data.frame(match = grepl(.$ngr,.$sen))) %>%
   group_by(sen,id_ngr) %>%
   summarise(sum_success = sum(match)) %>%
   mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full

 dt_full
Source: local data frame [4 x 4]
Groups: sen

                   sen id_ngr sum_success match
1 this is a good thing      2           2     1
2 this is a good thing      3           0     0
3          this is bad      2           1     1
4          this is bad      3           1     1


 # reshape table
 dt_full %>% dcast(., sen~id_ngr, value.var = "match")
                   sen 2 3
1 this is a good thing 1 0
2          this is bad 1 1
AntoniosK
  • 15,991
  • 2
  • 19
  • 32
  • I tried your method but it didn't work. The problem is that I don't use words but character N-Grams. I transformed my initial code to an optimized one using apply. But it doesn't work, I need help for using apply : `apply(cbind(data[1:10,1],1:10),1,function(x){ apply(list_ngrams,1,function(y){ if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1} }) }) ` – Taoufiq Mouhcine Aug 11 '15 at 13:29
  • What if you save your N-Grams as a vector form. Also, the id=5 you keep in your example is the 5th position that the word "residence" exists in your sentence, or the id of the N-Gram you spot a match? My method does the first one, but I can update it. – AntoniosK Aug 11 '15 at 13:56
  • The update would create all possible combinations of sentences and N-Grams instead of the combinations of sentences and positions (that it does now). So, for each sentence you'll get all the N-Grams that match. – AntoniosK Aug 11 '15 at 14:08
  • No you it's not. I'm sorry my example is not clear. 5 is the id of the bag that contains all the 5grams that exist in the word residence. Here is a sample of the 5-grams bags : ` idBag bag 1 30 resid;eside;siden;idenc;dence;ence_;_resid 2 2 franc;rance;ance_;_fran` If the address contains a given 5-gram I wanna get the id of the bag that contains that 5gram and put 1 in the column corresponding to that id. – Taoufiq Mouhcine Aug 11 '15 at 14:18
  • I think I'll be able to update my script and include that. Not sure if it will be quicker though. You'll have to check that as I'll be using a simpler version of your problem. – AntoniosK Aug 11 '15 at 14:22
  • You're welcome. Just double check it does exactly what you want. So, when you say "..."residence" exists in the bag ["resid","eside","dence",...]" does that mean that all elements of that N-Gram match with "residence", or at least one? Because the element "dence" matches with the word "evidence". What happens there? If you had an address "132 Evidence Valley" should it get the id from N-Gram ["resid","eside","dence",...]? – AntoniosK Aug 12 '15 at 09:27
  • Yes that's it. And it's not a problem because I'm gonna carry out a PCA on the result. – Taoufiq Mouhcine Aug 12 '15 at 12:00
  • Great. Happy that I've helped a bit. Good luck with the rest of the process! – AntoniosK Aug 12 '15 at 12:38