18

I want to implement some applications with n-grams (preferably in PHP).


Which type of n-grams is more adequate for most purposes? A word level or a character level n-gram? How could you implement an n-gram-tokenizer in PHP?


First, I would like to know what N-grams exactly are. Is this correct? It's how I understand n-grams:

Sentence: "I live in NY."

word level bigrams (2 for n): "# I', "I live", "live in", "in NY", 'NY #'

character level bigrams (2 for n): "#I", "I#", "#l", "li", "iv", "ve", "e#", "#i", "in", "n#", "#N", "NY", "Y#"

When you have this array of n-gram-parts, you drop the duplicate ones and add a counter for each part giving the frequency:

word level bigrams: [1, 1, 1, 1, 1]

character level bigrams: [2, 1, 1, ...]

Is this correct?


Furthermore, I would like to learn more about what you can do with n-grams:

  • How can I identify the language of a text using n-grams?
  • Is it possible to do machine translation using n-grams even if you don't have a bilingual corpus?
  • How can I build a spam filter (spam, ham)? Combine n-grams with a Bayesian filter?
  • How can I do topic spotting? For example: Is a text about basketball or dogs? My approach (do the following with a Wikipedia article for "dogs" and "basketball"): build the n-gram vectors for both documents, normalize them, calculate Manhattan/Euclidian distance, the closer the result is to 1 the higher is the similarity

What do you think about my application approaches, especially the last one?


I hope you can help me. Thanks in advance!

Hao Wooi Lim
  • 3,928
  • 4
  • 29
  • 35
caw
  • 30,999
  • 61
  • 181
  • 291

2 Answers2

26

Word n-grams will generally be more useful for most text analysis applications you mention with the possible exception of language detection, where something like character trigrams might give better results. Effectively, you would create n-gram vector for a corpus of text in each language you are interested in detecting and then compare the frequencies of trigrams in each corpus to the trigrams in the document you are classifying. For example, the trigram the probably appears much more frequently in English than in German and would provide some level of statistical correlation. Once you have your documents in n-gram format, you have a choice of many algorithms for further analysis, Baysian Filters, N- Nearest Neighbor, Support Vector Machines, etc..

Of the applications you mention, machine translation is probably the most farfetched, as n-grams alone will not bring you very far down the path. Converting an input file to an n-gram representation is just a way to put the data into a format for further feature analysis, but as you lose a lot of contextual information, it may not be useful for translation.

One thing to watch out for, is that it isn't enough to create a vector [1,1,1,2,1] for one document and a vector [2,1,2,4] for another document, if the dimensions don't match. That is, the first entry in the vector can not be the in one document and is in another or the algorithms won't work. You will wind up with vectors like [0,0,0,0,1,1,0,0,2,0,0,1] as most documents will not contain most n-grams you are interested in. This 'lining up' of features is essential, and it requires you to decide 'in advance' what ngrams you will be including in your analysis. Often, this is implemented as a two pass algorithm, to first decide the statistical significance of various n-grams to decide what to keep. Google 'feature selection' for more information.

Word based n-grams plus Support Vector Machines in an excellent way to perform topic spotting, but you need a large corpus of text pre classified into 'on topic' and 'off topic' to train the classifier. You will find a large number of research papers explaining various approaches to this problem on a site like citeseerx. I would not recommend the euclidean distance approach to this problem, as it does not weight individual n-grams based on statistical significance, so two documents that both include the, a, is, and of would be considered a better match than two documents that both included Baysian. Removing stop-words from your n-grams of interest would improve this somewhat.

knx
  • 398
  • 3
  • 21
bdk
  • 4,769
  • 29
  • 33
  • 1
    Thank you very much for this detailed answer! I still have one last question: What is the advantage of n-grams for the vectors over simple words for the vectors? I mean: Why should you split "I live in NY" into "I live, live in, in NY" instead of simply "I, live, in, NY"? – caw Jun 23 '09 at 13:23
  • 4
    Using words as features is equivelant to word based n-grams with n=1. The advantage of increasing n is you get increased context in your features. for example, knowing that two documents both include the n-gram "The Who" might be more useful than knowing that they both include "The" and "Who" seperately. – bdk Jun 23 '09 at 13:28
  • 2
    Also, note that If constructing vectors for n=3 (for example) n-grams, I would also include the n=2 and n=1 ngrams as well. I'm not sure if this is canonical, but projects I've worked on in the past have often done this. The advantage of increased n is additional context, but the disadvantage is smaller sample sets (any given 3 word phrase won't appear as often in a corpus as a 2 word phrase). Including n=1,2,3 n-grams together gives you the best of both worlds with the downside of additional storage and computation demands – bdk Jun 23 '09 at 13:34
  • 3
    This is one of the best answers I've seen on any topic on SO. – glomad Jun 23 '09 at 16:56
  • here's another post that might help http://stackoverflow.com/questions/21656861/bytes-vs-characters-vs-words-which-granularity-for-n-grams/21660459#21660459 – Mark Giaconia Jun 07 '14 at 15:04
2

You are correct about the definition of n-grams.

You can use word level n-grams for search type applications. Character level n-grams can be used more for analysis of the text itself. For example, to identify the language of a text, I would use the frequencies of the letters as compared to the established frequencies of the language. That is, the text should roughly match the frequency of occurrence of letters in that language.

An n-gram tokenizer for words in PHP can be done using strtok:

http://us2.php.net/manual/en/function.strtok.php

For characters use split:

http://us2.php.net/manual/en/function.str-split.php

Then you can just split the array as you'd like to any number of n-grams.

Bayesian filters need to be trained for use as spam filters, which can be used in combination with n-grams. However you need to give it plenty of input in order for it to learn.

Your last approach sounds decent as far as learning the context of a page... this is still however fairly difficult to do, but n-grams sounds like a good starting point for doing so.

AlbertoPL
  • 11,479
  • 5
  • 49
  • 73
  • Thank you. I think that strtok is too simple for good tokenization since you would have to add lots of tokens like: space, comma, dot, underscore, brackets etc. But the first paragraph, the use cases, is really helpful. Thanks! :) – caw Jun 23 '09 at 17:01