Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

N-grams may be derived for any positive integer N. 1-grams are called "unigrams," 2-grams are called "digrams," 3-grams are called "trigrams," and higher order N-grams are simply called by number, e.g. "4-grams". N-gram techniques may be applied to any kind of ordered data. Metadata such as end-of-sentence markers may or may not be included.

For example, using words as the elements and an N of 2, the English sentence "Three cows eat grass." could be broken into the 2-grams [{Three cows}, {cows eat}, {eat grass}, {grass #}], where # is a metadata marker denoting the end of the sentence.

As N-gram analysis embeds the data set into a vector space, it allows the application of many powerful statistical techniques to data for prediction, classification, and discernment of various properties.

More information:

874 questions

174

votes

17 answers

n-grams in python, four, five, six grams?

I'm looking for a way to split a text into n-grams. Normally I would do something like: import nltk from nltk import bigrams string = "I really like python, it's pretty awesome." string_bigrams = bigrams(string) print string_bigrams I am aware that…

asked Jul 08 '13 at 16:35

Shifu

2,115
3
17
15

votes

3 answers

Elasticsearch: Find substring match

I want to perform both exact word match and partial word/substring match. For example if I search for "men's shaver" then I should be able to find "men's shaver" in the result. But in case case I search for "en's shaver" then also I should be able…

autocomplete elasticsearch substring stringtokenizer n-gram

asked Apr 23 '14 at 12:11

Kruti Shukla

votes

5 answers

Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

I need to compare documents stored in a DB and come up with a similarity score between 0 and 1. The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possible to define how many grams to use), along…

python document n-gram tf-idf vsm

asked Mar 04 '10 at 15:22

seanieb

1,196
2
14
36

votes

6 answers

Python: Reducing memory usage of dictionary

I'm trying to load a couple of files into the memory. The files have either of the following 3 formats: string TAB int string TAB float int TAB float. Indeed, they are ngram statics files, in case this helps with the solution. For…

python memory dictionary compression n-gram

asked Apr 22 '12 at 03:03

Paul Hoang

1,014
2
11
21

votes

1 answer

Understanding the `ngram_range` argument in a CountVectorizer in sklearn

I'm a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the ngram_range argument works in a CountVectorizer. Running this code: from sklearn.feature_extraction.text import CountVectorizer vocabulary…

python scikit-learn n-gram feature-selection

asked Jun 03 '14 at 01:27

tumultous_rooster

12,150
32
92
149

votes

8 answers

Computing N Grams using Python

I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like: "Cystic fibrosis affects 30,000 children and young adults in the US alone Inhaling the mists of salt water can reduce the pus and infection that fills the…

python nlp nltk n-gram

asked Nov 16 '12 at 20:26

gran_profaci

8,087
15
66
99

votes

3 answers

Filename search with ElasticSearch

I want to use ElasticSearch to search filenames (not the file's content). Therefore I need to find a part of the filename (exact match, no fuzzy search). Example: I have files with the following…

lucene elasticsearch n-gram

asked Feb 23 '12 at 21:10

Biggie

7,037
10
33
42

votes

7 answers

N-gram generation from a sentence

How to generate an n-gram of a string like: String Input="This is my car." I want to generate n-gram with this input: Input Ngram size = 3 Output should be: This is my car This is is my my car This is my is my car Give some idea in Java, how to…

java lucene nlp n-gram

asked Sep 07 '10 at 07:53

Preetam Purbia

5,736
3
24
26

votes

4 answers

counting n-gram frequency in python nltk

I have the following code. I know that I can use apply_freq_filter function to filter out collocations that are less than a frequency count. However, I don't know how to get the frequencies of all the n-gram tuples (in my case bi-gram) in a…

python nltk n-gram

asked Jan 16 '13 at 18:00

Rkz

1,237
5
16
30

votes

6 answers

Counting bigrams (pair of two words) in a file using Python

I want to count the number of occurrences of all bigrams (pair of adjacent words) in a file using python. Here, I am dealing with very large files, so I am looking for an efficient way. I tried using count method with regex "\w+\s\w+" on file…

python regex n-gram

asked Sep 19 '12 at 04:44

swap310

votes

3 answers

Generate bigrams with NLTK

I am trying to produce a bigram list of a given sentence for example, if I type, To be or not to be I want the program to generate to be, be or, or not, not to, to be I tried the following code but just gives me

python nltk n-gram

asked Jun 06 '16 at 06:44

Nikhil Raghavendra

1,570
5
18
25

votes

4 answers

Python NLTK: Bigrams trigrams fourgrams

I have this example and i want to know how to get this result. I have text and I tokenize it then I collect the bigram and trigram and fourgram like that import nltk from nltk import word_tokenize from nltk.util import ngrams text = "Hi How are…

python nltk n-gram

asked Jun 22 '14 at 00:16

M.A.Hassan

votes

4 answers

Quick implementation of character n-grams for word

I wrote the following code for computing character bigrams and the output is right below. My question is, how do I get an output that excludes the last character (ie t)? and is there a quicker and more efficient method for computing character…

python-2.7 n-gram

asked Sep 06 '13 at 12:39

Tiger1

1,327
5
19
40

votes

4 answers

Find best substring match

I'm looking for a library or a method using existing libraries( difflib, fuzzywuzzy, python-levenshtein) to find the closest match of a string (query) in a text (corpus) I've developped a method based on difflib, where I split my corpus into ngrams…

python match distance n-gram

asked Mar 15 '16 at 13:54

Ghilas BELHADJ

13,412
10
59
99

votes

3 answers

How to remove stopwords efficiently from a list of ngram tokens in R

Here's an appeal for a better way to do something that I can already do inefficiently: filter a series of n-gram tokens using "stop words" so that the occurrence of any stop word term in an n-gram triggers removal. I'd very much like to have one…

r performance n-gram stop-words text-analysis

asked Oct 12 '15 at 00:09

Ken Benoit

14,454
27
50

2 3

…

58 59 Next