I am trying to analyses ngram of a corpus stored in a data.table. I want to calculate all the 1-gram (or 2,3,4-gram), store them, their count and in which row they appear in a data.table. I have managed using sapply:
smallCorpus<-data.table(id = 1:3,
corpus = c("<s> exactly how long do you want a you tube videos to be anyway </s>","<s> google scrapped the early version of its smart glasses in january </s>","<s> exactly how long do you want a you tube videos to be anyway </s> <s> today we are announcing the success of our integration test </s>"),
key="id")
library(stringi,tau)
genNgramTable<-function(cC,n){
Count<- textcnt(cC[,corpus],n=n,split=" ",method="string",decreasing=TRUE)
Ngram<-data.table(gram=names(Count),count=Count,key="gram")
listOfOcc<-sapply(Ngram[,gram],
function(gram,corpus){which(stri_detect_fixed(corpus," "%s+%gram%s+%" "))},
cC[,corpus])
Ngram<-Ngram[,Fkey:=listOfOcc]
}
gram1<-genNgramTable(smallCorpus,1L)
My question is: Is it possible to use a data.table call to do this (my hope is that it will be faster). I have tried:
genNgramTable<-function(cC,n){
Count<- textcnt(cC[,corpus],n=n,split=" ",method="string",decreasing=TRUE)
Ngram<-data.table(gram=names(Count),count=Count,key="gram")
Ngram<-Ngram[,Fkey:=which(stri_detect_fixed(cC[,corpus]," "%s+%gram%s+%" "))]
}
it give the warning
Warning message:
In `[.data.table`(Ngram, , `:=`(Fkey, which(stri_detect_fixed(cC[, :
Supplied 17 items to be assigned to 33 items of column 'Fkey' (recycled leaving remainder of 16 items).
and does only give me one number in the Fkey column. Moreover this number is outside the scope of my row numbers (1:3).
I will be grateful if someone can explain me why.