I am working with a big corpus (~30GB) and I need to extract sentences containing a list of words (~5000) including the punctuation. I'm using the regex approach but I'm open at any suggestions regarding the efficiency of the method. The following code extract the sentences including 'anarchism', but without the punctuation, obtained from here.
f_in = open(f_path, 'r')
for line in f_in:
sentences = re.findall(r'([^.!?]*anarchism[^.!?]*)', line)
Input:
anarchism, is good. anarchism? anarchism!
Actual return:
['anarchism, is good', ' anarchism', ' anarchism']
Expected return:
['anarchism, is good.', 'anarchism?', 'anarchism!']
Any suggestions?