extracting n grams from huge text

Question

For example we have following text:

"Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. ..."

I need all possible section of this text respectively, for one word by one word, then two by two, three by three to five to five. like this:

ones : ['Spark', 'is', 'a', 'framework', 'for', 'writing, 'fast', 'distributed', 'programs', ...]

twos : ['Spark is', 'is a', 'a framework', 'framework for', 'for writing' ...]

threes : ['Spark is a', 'is a framework', 'a framework for', 'framework for writing', 'for writing fast', ...]

. . .

fives : ['Spark is a framework for', 'is a framework for writing', 'a framework for writing fast','framework for writing fast distributed', ...]

Please note that the text to be processed is huge text( about 100GB). I need the best solution for this process. May be it should be processed multi thread in parallel.

I don't need whole list at once, it can be streaming.

You realize that the number of potential phrases that this will generate is cripplingly large, as in possibly trillions, correct? — elixenide, Jun 05 '15 at 22:03
The best as far I can think would be to load the contents once, and `slice` them, As you need to create the whole list as once so you may not be looking for some techinque to read line by line. — ZdaR, Jun 05 '15 at 22:03
@EdCottrell yeah, I drop many of them by some conditions, then some of others will drop by an operator. — Arash Mousavi, Jun 05 '15 at 22:07
@ZdaR: it is impossible to load the content with 100GB size at once. no, don't need the whole list at once. — Arash Mousavi, Jun 05 '15 at 22:10
so you have to clean the text also? I presume you are also writing this to disk as you go? — Padraic Cunningham, Jun 05 '15 at 22:11
Your source data is `100GB`, if we load the data in chunks as well, but at the end when we will save the data inside the `list`, that would be nearly of same size as well I guess — ZdaR, Jun 05 '15 at 22:12
@ArashMousavi there is a difference between extracting n-grams and extracting "important expressions". Which one do you want? The examples you mentioned are n-grams - not important expressions. — Aditya, Jun 07 '15 at 15:40

score 6 · Accepted Answer · edited May 23 '17 at 10:33

First of all, make sure that you have lines in your file then with no worries you can read it line-by-line (discussed here):

with open('my100GBfile.txt') as corpus:
    for line in corpus:
        sequence = preprocess(line)
        extract_n_grams(sequence)

Let's assume that your corpus doesn't need any special treatment. I guess you can find a suitable treatment for your text, I only want it to be chucked into desirable tokens:

def preprocess(string):
    # do what ever preprocessing that it needs to be done
    # e.g. convert to lowercase: string = string.lower()
    # return the sequence of tokens
    return string.split()

I don't know what do you want to do with n-grams. Lets assume that you want to count them as a language model which fits in your memory (it usually does, but I'm not sure about 4- and 5-grams). The easy way is to use off the shelf nltk library:

from nltk.util import ngrams

lm = {n:dict() for n in range(1,6)}
def extract_n_grams(sequence):
    for n in range(1,6):
        ngram = ngrams(sentence, n)
        # now you have an n-gram you can do what ever you want
        # yield ngram
        # you can count them for your language model?
        for item in ngram:
            lm[n][item] = lm[n].get(item, 0) + 1

It should be pointed out that this does not produce n-grams crossing lines. — Joachim Wagner, Nov 30 '22 at 14:17
Good point :) it is possible to keep a cache of n-1 last tokens within the line-reading loop. That way, we can pass that to the following line so that these tokens could be like the natural left-padding of the sequence while we should stop the n-grams at the end of the sequence without right-padding. — Mehdi, Feb 27 '23 at 06:53

distort86 · Answer 2 · 2015-06-07T17:29:25.827

If you don't need the whole list at once, then the best choice should be to use iterators. Thus, my solution looks like this:

import re
text = "Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API."
word_re = re.compile(r"\w+")
words = [text[word.start():word.end()] for word in word_re.finditer(text)]
ngrams = ((words[k] for k in xrange(j, j + i + 1)) for i in xrange(len(words)) for j in xrange(len(words) - i))
for ngram in ngrams:
    for word in ngram:
        print word,
    print

This gives you all the needed ngrams in the desired order. Note that iterators are inevitable even for the ngrams themselves, as your text is as huge as 500G, and the majority of your "all possible sections" will not fit into your memory.

Note also that in your case you will need to count the length of your text and get the words from it separately, as you will not be able to hold it in memory like in my code sample.

score 1 · Answer 3 · answered Feb 24 '16 at 14:09

This should ideally do it. You can customize the parameters min_len and max_len to suit your needs. Applying a sort function too can give you a good idea as to which n-grams stand out from the others.

import nltk
from nltk.util import *
from nltk.collocations import *

content = "Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. ..."
tokens = nltk.word_tokenize(content)
bgs = everygrams(tokens, min_len=users_minimium, max_len=users_maximum)
fdist_bg = nltk.FreqDist(bgs)
for k,v in fdist_bg.items():
        print k,v

And since you mention parallel execution you can check out the following code snippet using Spark MLib and Python

from pyspark.ml.feature import NGram
wordDataFrame = sqlContext.createDataFrame([
(0, ["Hi", "I", "heard", "about", "Spark"]),
(1, ["I", "wish", "Java", "could", "use", "case", "classes"]),
(2, ["Logistic", "regression", "models", "are", "neat"])], ["label","words"])
ngram = NGram(inputCol="words", outputCol="ngrams")
ngramDataFrame = ngram.transform(wordDataFrame)
for ngrams_label in ngramDataFrame.select("ngrams", "label").take(3):
    print(ngrams_label)

Link to the solution and other Feature Extraction techniques in Spark is here: Spark MLib Feature extraction

Hope it helps. Cheers. :)

score 1 · Answer 4 · answered Aug 12 '18 at 07:20

I've written a C library that does this: https://github.com/adriaan-pelzer/corpusToNgrams

Essentially, the most performant way I could think of doing was this:

Parse through the corpus, character by character.
If you encounter a word end character (space, comma, full stop, etc), add the word to a 1-gram array.
Repeat for n = 2 to maxN:
If the length of the 1-gram array is larger than n, concatenate the last n words from the 1-gram array and add it to the n-gram array.

The above can be implemented in a recursive function, and requires you to only parse the corpus once.

score 0 · Answer 5 · edited May 23 '17 at 10:29

0

Have you written any code? Try googling N-gram generation or looking here: Computing N Grams using Python

Looks like you want to generate 1-grams (a list of the words), up through 5-grams.

I would do each in one pass, then move on to n+1-gram.

edited May 23 '17 at 10:29

Community

1
1

answered Jun 05 '15 at 22:23

Will

4,299
5
32
50

This should be a comment more than an answer, the answers in the linked question are also not going to work for data this size, at some stage the slices would be 100gig – Padraic Cunningham Jun 05 '15 at 22:27

extracting n grams from huge text

5 Answers5

Linked