Regular Expression taking far too long; does anyone have a suggesion on speeding this one up?

Question

So I'm trying to match a regular expression to a paragraph in order to do sentiment analysis, but tqdm is saying this could take about 300 hours. I was wondering if anyone has a critique on what I could do to improve the way my RE functions

I'm trying to match stem endings to negative words for this analysis. Here is a small snippet of the expression for the match. I'm only showing a small snippet because the entire expression contains about 2800 terms, and is set up entirely the same all the way through, hence the ellipses.

Here is the function that I'm using to match the stems in the paragraphs

def neg_stems(paragraph):
    stem_list = []
    i = " ".join(paragraph)
    for n in re.finditer(regex_neg, i):
        if n.group():
            stem_list.append(n.group())
    return json.dumps(stem_list)

And finally, here is just the general output that I'm getting

neg_stems(["the king abdicated the throne in an argument where he was angry, but his son was pretty happy about it","I hate cats but love hedgehogs"])

> ["abdicat", "argument", "anger", "hate"]

I'm just trying to count the number of negative words as defined by the semantic dictionary in regex_neg, but ~300 hours is just way too long, and even then, that's simply an estimate.

Does anyone have a suggestion on what I could do to try and speed this process up?

Thank you in advance!

I can't post this as an answer, because the answer itself would be too broad, but one option to consider here would be to load your text(s) into a database which supports full text search. Then, use FTS to find the stems in the text. Python perhaps is not as well suited for this type of operation as a database. — Tim Biegeleisen, Mar 20 '19 at 13:59
Looks like a plain-text search to me, are you sure you need a regex? — Aaron, Mar 20 '19 at 14:00
I'm not sure regex is well-suited to this problem. Regex is *great* as a tokenizer, but as far as matching stems, a dictionary of values (or a `set`) might be the best bet here — C.Nivs, Mar 20 '19 at 14:03
[This thread](https://stackoverflow.com/questions/42742810/speed-up-millions-of-regex-replacements-in-python-3) is likely the answer. See [this post](https://stackoverflow.com/a/42789508/3832970). — Wiktor Stribiżew, Mar 20 '19 at 14:04
Ok, thank you for the input guys! I really don't need to use it as a regex, so I'll go about with plain text matching a see how that progresses instead! — Phil Robinson, Mar 20 '19 at 14:10

Regular Expression taking far too long; does anyone have a suggesion on speeding this one up?

0 Answers0