0

I have a large corpus that I want to remove certain words from. Similar to removing stopwords from the text, but rather I now want to remove bigrams from the corpus. I have my list of bigrams, but obviously the simple list comprehension way to remove stopwords isn't going to cut it. I was thinking to use regex and compile a pattern from a list of words and then substituting the words. Here is some sample code:

txt = 'He was the type of guy who liked Christmas lights on his house in the middle of July. He picked up trash in his spare time to dump in his neighbors yard. If eating three-egg omelets causes weight-gain, budgie eggs are a good substitute. We should play with legos at camp. She cried diamonds. She had some amazing news to share but nobody to share it with. He decided water-skiing on a frozen lake wasn’t a good idea. His eyes met mine on the street. When he asked her favorite number, she answered without hesitation that it was diamonds. She is never happy until she finds something to be unhappy about; then, she is overjoyed.'

--

import re
words_to_remove = ['this is', 'We should', 'Christmas lights']
pattrn = re.compile(r' | '.join(words_to_remove))
pattrn.sub(' ',txt)

%timeit pattrn.sub(' ',txt)

--

timeit 1: 9.18 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Is there a faster way for me to remove these bigrams? The len of the actual corpus is 556,694,135 characters and the number of bigrams is 3,205,182 this is really slow when doing it on the actual dataset.

dawg
  • 98,345
  • 23
  • 131
  • 206
Kevin
  • 3,077
  • 6
  • 31
  • 77

1 Answers1

0

You can rewrite your regex to have the structure of a trie (instead of word|worse|wild use w(or(d|se)|ild)), or even better, ditch the regex and use the Aho–Corasick algorithm. Of course you can use a library for that, for instance FlashText (which is a slimmed down version of Aho-Corasick, specialized for searching and replacing whole words as in your case).

The author of FlashText claims »Regex was taking 5 days to run. So I built a tool that did it in 15 minutes.«

Socowi
  • 25,550
  • 3
  • 32
  • 54