14

What algorithm is used for finding ngrams?

Supposing my input data is an array of words and the size of the ngrams I want to find, what algorithm I should use?

I'm asking for code, with preference for R. The data is stored in database, so can be a plgpsql function too. Java is a language I know better, so I can "translate" it to another language.

I'm not lazy, I'm only asking for code because I don't want to reinvent the wheel trying to do an algorithm that is already done.

Edit: it's important know how many times each n-gram appears.

Edit 2: there is a R package for N-GRAMS?

Ben
  • 41,615
  • 18
  • 132
  • 227
Renato Dinhani
  • 35,057
  • 55
  • 139
  • 199

7 Answers7

24

If you want to use R to identify ngrams, you can use the tm package and the RWeka package. It will tell you how many times the ngram occurs in your documents, like so:

  library("RWeka")
  library("tm")

  data("crude")

  BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
  tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))

  inspect(tdm[340:345,1:10])

A term-document matrix (6 terms, 10 documents)

Non-/sparse entries: 4/56
Sparsity           : 93%
Maximal term length: 13 
Weighting          : term frequency (tf)

               Docs
Terms           127 144 191 194 211 236 237 242 246 248
  and said        0   0   0   0   0   0   0   0   0   0
  and security    0   0   0   0   0   0   0   0   1   0
  and set         0   1   0   0   0   0   0   0   0   0
  and six-month   0   0   0   0   0   0   0   1   0   0
  and some        0   0   0   0   0   0   0   0   0   0
  and stabilise   0   0   0   0   0   0   0   0   0   1

hat-tip: http://tm.r-forge.r-project.org/faq.html

Ben
  • 41,615
  • 18
  • 132
  • 227
8

For anyone still interested in this topic, there is a package on the cran already.

ngram: An n-gram Babbler

This package offers utilities for creating, displaying, and "babbling" n-grams. The babbler is a simple Markov process.

http://cran.r-project.org/web/packages/ngram/index.html

IceBruce
  • 81
  • 1
  • 4
3

Usually the n-grams are calculated to find its frequency distribution. So Yes, it does matter how many times the n-grams appear.

Also you want character level n-gram or word level n-gram. I have written a code for finding the character level n-gram from a csv file in r. I used package 'tau' for that. You can find it here.

Also here is the code I wrote:

 library(tau)
temp<-read.csv("/home/aravi/Documents/sample/csv/ex.csv",header=FALSE,stringsAsFactors=F)
r<-textcnt(temp, method="ngram",n=4L,split = "[[:space:][:punct:]]+", decreasing=TRUE)
a<-data.frame(counts = unclass(r), size = nchar(names(r)))
b<-split(a,a$size)
b

Cheers!

Aravind Asok
  • 514
  • 1
  • 7
  • 18
1

EDIT: Sorry, this is PHP. I wasn't quite sure what you wanted. I don't know it in java but perhaps the following could be converted easily enough.

Well it depends on the size of the ngrams you want.

I've had quite a lot of success with single letters (especially accurate for language detection), which is easy to get with:

$letters=str_split(preg_replace('/[^a-z]/', '', strtolower($text)));
$letters=array_count_values($letters);

Then there is the following function for calculating ngrams from a word:

function getNgrams($word, $n = 3) {
        $ngrams = array();
        $len = strlen($word);
        for($i = 0; $i < $len; $i++) {
                if($i > ($n - 2)) {
                        $ng = '';
                        for($j = $n-1; $j >= 0; $j--) {
                                $ng .= $word[$i-$j];
                        }
                        $ngrams[] = $ng;
                }
        }
        return $ngrams;
}

The source of the above is here, which I recommend you read, and they have lots of functions to do exactly what you want.

Alasdair
  • 13,348
  • 18
  • 82
  • 138
0

You can use ngram package. One example of its usage is http://amunategui.github.io/speak-like-a-doctor/

Niru
  • 1
  • Hello, welcome to SO. This answer relies almost completely on external links. Should they ever become invalid, your answer would become useless. So please edit it and add at least a summary of what can be found there. Thank you! – Fabio says Reinstate Monica Mar 31 '16 at 16:35
0

Have a look at https://cran.r-project.org/web/packages/ngram/vignettes/ngram-guide.pdf

Here is a quick example. It's quite fast look at the benchmark of the vignette.

require(ngram)

"hi i am ig" %>% ngram(n = 2) %>% get.ngrams()
Indranil Gayen
  • 702
  • 1
  • 4
  • 17
0

Simple heres the java answer:

int ngrams = 9;// let's say 9-grams since it's the length of "bonasuera"... 
String string = "bonasuera";
for (int j=1; j <= ngrams;j++) {    
    for (int k=0; k < string.length()-j+1;k++ )
        System.out.print(string.substring(k,k+j) + " ");
    System.out.println();
}

output :

b o n a s u e r a 
bo on na as su ue er ra 
bon ona nas asu sue uer era 
bona onas nasu asue suer uera 
bonas onasu nasue asuer suera 
bonasu onasue nasuer asuera 
bonasue onasuer nasuera 
bonasuer onasuera 
bonasuera