Implementing n-grams for next word prediction

Question

I'm trying to utilize a trigram for next word prediction.

I have been able to upload a corpus and identify the most common trigrams by their frequencies. I used the "ngrams", "RWeka" and "tm" packages in R. I followed this question for guidance:

What algorithm I need to find n-grams?

text1<-readLines("MyText.txt", encoding = "UTF-8")
corpus <- Corpus(VectorSource(text1))

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max =       3))
tdm <- TermDocumentMatrix(corpus, control = list(tokenize =      BigramTokenizer))

If a user were to input a set a words, how would I go about generating the next word? For example, if a user types "can of", how would would I retrieve the three most likely words (e.g. beer, soda, paint, etc..)?

Please provide a reproducible example, which shows example data, what you tried and why it failed. — lukeA, Jul 09 '15 at 11:51

score 4 · Accepted Answer · answered Jul 09 '15 at 11:48

4

Here`s one way as a starter:

f <- function(queryHistoryTab, query, n = 2) {
  require(tau)
  trigrams <- sort(textcnt(rep(tolower(names(queryHistoryTab)), queryHistoryTab), method = "string", n = length(scan(text = query, what = "character", quiet = TRUE)) + 1))
  query <- tolower(query)
  idx <- which(substr(names(trigrams), 0, nchar(query)) == query)
  res <- head(names(sort(trigrams[idx], decreasing = TRUE)), n)
  res <- substr(res, nchar(query) + 2, nchar(res))
  return(res)
}
f(c("Can of beer" = 3, "can of Soda" = 2, "A can of water" = 1, "Buy me a can of soda, please" = 2), "Can of")
# [1] "soda" "beer"

answered Jul 09 '15 at 11:48

lukeA

53,097
5
97
100

Thanks. I see how you created the function f, but would this be feasible if I have thousands of trigrams that I want to use as a training set? – statsguyz Jul 09 '15 at 11:54
I thought of pre-building a data base using the function for common terms. Not everything must rely on machine learning. In addition: please edit your question and provide the example code with which your worked. – lukeA Jul 09 '15 at 11:56
1

@lukeA how would you feed a term document matrix (tdm) into the function f? – Economist_Ayahuasca Oct 09 '16 at 17:51

score 0 · Answer 2 · edited Feb 26 '17 at 18:07

I just tried! Hopefully the following code with comments will help you, but I would would like to see how RNN might work on trigrams! NaiveBayes has not done decent job owing to might be sparsity of trigram. Gram_12 is actually bi gram of first two words in trigram. Consider this as first step, not the final model for your effort.

library(stringr)
library(qdap)
if (word_count(qry) >= 2){
    lastwd<-word(qry,-2:-1)
    test<-paste(lastwd[1],lastwd[2])
    #Check if you find matching last two words in trigram Gram_12
    index1 <- with(tri.df, grepl(test, tri.df$Gram_12))
    #If found
    if(any(index1)){
        #Subset the trigram and group by Gram_3
        index1 <- with(tri.df, grepl(test, tri.df$Gram_12))
        filtered<-tri.df[index1, ]
        #Find frequency of each unique group
        freq<-data.frame(table(filtered$Gram_3))
        #Order by Frequency of Gram_3 & return top 5
        freq<-head(freq[order(-freq$Freq),],5)
        predict<-as.character(freq[(freq$Freq>0),]$Var1)
        #return(predict)
    }
    else { #If notfound
        #Get only last word
        library(stringr)
        lastwd<-word(qry,-1)
        #Search in bi gram Gram_1 and Group by Gram_2
        index2 <- with(bi.df, grepl(lastwd, bi.df$Gram_1))
        if(any(index2)){
            filtered<-bi.df[index2, ]
            #Find frequency of each unique group
            freq<-data.frame(table(filtered$Gram_2))
            #Order by Frequency of Gram 2
            freq<-head(freq[order(-freq$Freq),],5)
            predict<-as.character(freq[(freq$Freq>0),]$Var1)
        }
        else{
            (predict<-"Need more training to predict")
        }
    }
}
else {
    #else if length words==1 & Applied
    library(stringr)
    lastwd<-word(qry,-1)
    index3 <- with(bi.df, grepl(lastwd, bi.df$Gram_1))
    if(any(index3)){
        filtered<-bi.df[index3, ]
        #Find frequency of each unique group
        freq<-data.frame(table(filtered$Gram_2))
        #Order by Frequency of Gram 2
        freq<-head(freq[order(-freq$Freq),],5)
        predict<-as.character(freq[(freq$Freq>0),]$Var1)
    }
    else{
        (predict<-"Need more training to predict")
    }
}

Implementing n-grams for next word prediction

2 Answers2