Extracting sentences including a word from large corpus, including the punctuation, in python

Question

I am working with a big corpus (~30GB) and I need to extract sentences containing a list of words (~5000) including the punctuation. I'm using the regex approach but I'm open at any suggestions regarding the efficiency of the method. The following code extract the sentences including 'anarchism', but without the punctuation, obtained from here.

f_in = open(f_path, 'r')
for line in f_in:
    sentences = re.findall(r'([^.!?]*anarchism[^.!?]*)', line)

Input:

anarchism, is good. anarchism? anarchism!

Actual return:

['anarchism, is good', ' anarchism', ' anarchism']

Expected return:

['anarchism, is good.', 'anarchism?', 'anarchism!']

Any suggestions?

score 1 · Answer 1 · answered Apr 04 '20 at 00:12

With [^.!?]* at the end of your pattern, you're explicitly excluding any punctuation. If you're certain that your sentence ends in exactly one of [.!?], you could just add that to the pattern:

>>> import re
>>> line = "anarchism, is good. anarchism? anarchism!"
>>> re.findall(r'([^.!?]*anarchism[^.!?]*[.!?])', line)
['anarchism, is good.', ' anarchism?', ' anarchism!']

score 1 · Accepted Answer · answered Apr 04 '20 at 00:32

Your pattern will split sentences in places you probably don't like; for example, "Mr. Tamblay" (because of the period). You can use a sentence tokenizer from nltk for a more sophisticated split. To actually check if any of your words is in the sentence, you can of course filter over the sentence tokens.

import nltk
sentence_tokenzer = nltk.tokenize.punkt.PunktSentenceTokenizer()
...
for line in f_in:
    for start, end in sentence_tokenizer.span_tokenize(line):
        sentence = line[start:end]
        for keyword in keywords:
            if keyword in sentence:
                do_something()

If basic iterations over all the keywords are too slow, you can explore options to search the sentence for all strings at once using the Aho-Corasick algorithm.

Extracting sentences including a word from large corpus, including the punctuation, in python

2 Answers2