-2

I have two datasets :

A 10*1 matrix containing names of countries :

countries<-structure(
  c("usa", "canada", "france", "england", "brazil",
    "spain", "germany", "italy", "belgium", "switzerland"),
  .Dim = c(10L,1L))

And a 20*2 matrix containing 3-grams and ids of those 3-grams :

tri_grams<-    structure(
  c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", 
    "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
    "mo", "an", "ce", "ko", "we", "ge", "ma", "fi", "br", "ca",
    "gi", "po", "ro", "ch", "ru", "tz", "il", "sp", "ai", "jo"), 
  .Dim = c(20L,2L),
  .Dimnames = list(NULL, c("id", "triGram")))

I want to loop the countries and for each row get the tri_grams that exist in the country. For example in brazil there is "br" and "il". I want to get the information : (index of the country (double), id of tri-grams (char)). Therefore for brazil I wanna get : (5,"49") and (5,"25").

Here is the code with a simple loop :

res <- matrix(ncol=2,nrow=nrow(countries)*nrow(tri_grams))
colnames(res) <- c("indexCountry","idTriGram")
k <- 0

for(i in 1:nrow(countries))
{
  for(j in 1:nrow(tri_grams))
  {
    if(grepl(tri_grams[j,2],countries[i,1])==TRUE)
    {
      k <- k+1
      res[k,1] <- i
      res[k,2] <- tri_grams[j,1]
    }
  }
}
res <- res[1:k,]

It works perfectly and here is the results :

     indexCountry idTriGram
 [1,] "2"          "2"      
 [2,] "2"          "10"     
 [3,] "3"          "2"      
 [4,] "3"          "3"      
 [5,] "4"          "2"      
 [6,] "5"          "9"      
 [7,] "5"          "17"     
 [8,] "6"          "18"     
 [9,] "6"          "19"     
[10,] "7"          "2"      
[11,] "7"          "6"      
[12,] "7"          "7"      
[13,] "9"          "11"     
[14,] "10"         "2"      
[15,] "10"         "16"   

I want to get the same result but using apply. I actually have a huge dataset, and this is just a sample of my real dataset. When I use the simple loop method on my real dataset it takes a very long time running (more than 10 hours). I tried to code it using apply but I didn't succeed.

A. Webb
  • 26,227
  • 1
  • 63
  • 95
  • 2
    is the country dataset also large in your real data? It might be worth precalculating the possible ngrams in their names – jeremycg Aug 11 '15 at 15:45
  • 2
    Instead of the actual data, could you post the results from `dput(countries)` and `dput(tri_grams)` as the data. It would make it easier for us to get your data into R – Rich Scriven Aug 11 '15 at 15:48
  • Thank you for your answer. Yes It's large too because it doesn't contain only countries but cities, areas, ... I have to use the data that I have because I'm not allowed to generate the n-grams by myself. – Taoufiq Mouhcine Aug 11 '15 at 15:51
  • @RichardScriven I updated the post and added dputs – Taoufiq Mouhcine Aug 11 '15 at 15:54
  • 1
    possible duplicate of [Optimization of an R loop taking 18 hours to run](http://stackoverflow.com/questions/31938118/optimization-of-an-r-loop-taking-18-hours-to-run) – A. Webb Aug 11 '15 at 16:35
  • Just realized this is a duplicate. Unless you are asking something here that you aren't there, just edit the previous question to clarify. – A. Webb Aug 11 '15 at 16:36
  • 1
    Apply functions are not magical, if you look at the inner workings of the apply family of functions they are just well-constructed for loops. Your optimization likely needs to be done using a different approach than just shoe-horning your for loop into an apply function. – Forrest R. Stevens Aug 11 '15 at 17:58

1 Answers1

2

I don't know how much faster this really is, but here is at least a succinct way to get the same results.

x<-which(outer(tri_grams[,"triGram"],countries,Vectorize(grepl))[,,1],arr.ind=TRUE)
cbind(country=x[,2],trigram=x[,1])
     country trigram
 [1,]       2       2
 [2,]       2      10
 [3,]       3       2
 [4,]       3       3
 [5,]       4       2
 [6,]       5       9
 [7,]       5      17
 [8,]       6      18
 [9,]       6      19
[10,]       7       2
[11,]       7       6
[12,]       7       7
[13,]       9      11
[14,]      10       2
[15,]      10      16
A. Webb
  • 26,227
  • 1
  • 63
  • 95
  • Thank you for your method. I updated it and adapted it to my problem and It worked very well. It's not as fast as apply but It's much faster than my old method. Thank you very much. – Taoufiq Mouhcine Aug 12 '15 at 08:37