Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

N-grams may be derived for any positive integer N. 1-grams are called "unigrams," 2-grams are called "digrams," 3-grams are called "trigrams," and higher order N-grams are simply called by number, e.g. "4-grams". N-gram techniques may be applied to any kind of ordered data. Metadata such as end-of-sentence markers may or may not be included.

For example, using words as the elements and an N of 2, the English sentence "Three cows eat grass." could be broken into the 2-grams [{Three cows}, {cows eat}, {eat grass}, {grass #}], where # is a metadata marker denoting the end of the sentence.

As N-gram analysis embeds the data set into a vector space, it allows the application of many powerful statistical techniques to data for prediction, classification, and discernment of various properties.

More information:

  1. Google's Ngram Viewer
  2. Wikipedia article
874 questions
174
votes
17 answers

n-grams in python, four, five, six grams?

I'm looking for a way to split a text into n-grams. Normally I would do something like: import nltk from nltk import bigrams string = "I really like python, it's pretty awesome." string_bigrams = bigrams(string) print string_bigrams I am aware that…
Shifu
  • 2,115
  • 3
  • 17
  • 15
68
votes
3 answers

Elasticsearch: Find substring match

I want to perform both exact word match and partial word/substring match. For example if I search for "men's shaver" then I should be able to find "men's shaver" in the result. But in case case I search for "en's shaver" then also I should be able…
55
votes
5 answers

Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

I need to compare documents stored in a DB and come up with a similarity score between 0 and 1. The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possible to define how many grams to use), along…
seanieb
  • 1,196
  • 2
  • 14
  • 36
54
votes
6 answers

Python: Reducing memory usage of dictionary

I'm trying to load a couple of files into the memory. The files have either of the following 3 formats: string TAB int string TAB float int TAB float. Indeed, they are ngram statics files, in case this helps with the solution. For…
Paul Hoang
  • 1,014
  • 2
  • 11
  • 21
43
votes
1 answer

Understanding the `ngram_range` argument in a CountVectorizer in sklearn

I'm a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the ngram_range argument works in a CountVectorizer. Running this code: from sklearn.feature_extraction.text import CountVectorizer vocabulary…
tumultous_rooster
  • 12,150
  • 32
  • 92
  • 149
33
votes
8 answers

Computing N Grams using Python

I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like: "Cystic fibrosis affects 30,000 children and young adults in the US alone Inhaling the mists of salt water can reduce the pus and infection that fills the…
gran_profaci
  • 8,087
  • 15
  • 66
  • 99
32
votes
3 answers

Filename search with ElasticSearch

I want to use ElasticSearch to search filenames (not the file's content). Therefore I need to find a part of the filename (exact match, no fuzzy search). Example: I have files with the following…
Biggie
  • 7,037
  • 10
  • 33
  • 42
32
votes
7 answers

N-gram generation from a sentence

How to generate an n-gram of a string like: String Input="This is my car." I want to generate n-gram with this input: Input Ngram size = 3 Output should be: This is my car This is is my my car This is my is my car Give some idea in Java, how to…
Preetam Purbia
  • 5,736
  • 3
  • 24
  • 26
31
votes
4 answers

counting n-gram frequency in python nltk

I have the following code. I know that I can use apply_freq_filter function to filter out collocations that are less than a frequency count. However, I don't know how to get the frequencies of all the n-gram tuples (in my case bi-gram) in a…
Rkz
  • 1,237
  • 5
  • 16
  • 30
30
votes
6 answers

Counting bigrams (pair of two words) in a file using Python

I want to count the number of occurrences of all bigrams (pair of adjacent words) in a file using python. Here, I am dealing with very large files, so I am looking for an efficient way. I tried using count method with regex "\w+\s\w+" on file…
swap310
  • 768
  • 2
  • 8
  • 22
25
votes
3 answers

Generate bigrams with NLTK

I am trying to produce a bigram list of a given sentence for example, if I type, To be or not to be I want the program to generate to be, be or, or not, not to, to be I tried the following code but just gives me
Nikhil Raghavendra
  • 1,570
  • 5
  • 18
  • 25
25
votes
4 answers

Python NLTK: Bigrams trigrams fourgrams

I have this example and i want to know how to get this result. I have text and I tokenize it then I collect the bigram and trigram and fourgram like that import nltk from nltk import word_tokenize from nltk.util import ngrams text = "Hi How are…
M.A.Hassan
  • 500
  • 2
  • 7
  • 16
23
votes
4 answers

Quick implementation of character n-grams for word

I wrote the following code for computing character bigrams and the output is right below. My question is, how do I get an output that excludes the last character (ie t)? and is there a quicker and more efficient method for computing character…
Tiger1
  • 1,327
  • 5
  • 19
  • 40
20
votes
4 answers

Find best substring match

I'm looking for a library or a method using existing libraries( difflib, fuzzywuzzy, python-levenshtein) to find the closest match of a string (query) in a text (corpus) I've developped a method based on difflib, where I split my corpus into ngrams…
Ghilas BELHADJ
  • 13,412
  • 10
  • 59
  • 99
20
votes
3 answers

How to remove stopwords efficiently from a list of ngram tokens in R

Here's an appeal for a better way to do something that I can already do inefficiently: filter a series of n-gram tokens using "stop words" so that the occurrence of any stop word term in an n-gram triggers removal. I'd very much like to have one…
Ken Benoit
  • 14,454
  • 27
  • 50
1
2 3
58 59