1

So I'm trying to match a regular expression to a paragraph in order to do sentiment analysis, but tqdm is saying this could take about 300 hours. I was wondering if anyone has a critique on what I could do to improve the way my RE functions

I'm trying to match stem endings to negative words for this analysis. Here is a small snippet of the expression for the match. I'm only showing a small snippet because the entire expression contains about 2800 terms, and is set up entirely the same all the way through, hence the ellipses.

regex_neg = ((a lie)|(abandon)|(abas)|(abattoir)|(abdicat)|(aberra)|(abhor)|(abject)|(abnormal)|(abolish)|(abominab)|(abominat)|(abrasiv)|(absent)|(abstrus)|(absurd)|(abus)|(accident)|(accost)|(accursed)|(accusation)|(accuse)|(accusing)|(acerbi)|(ache)|(aching)|(achy)|(acomia)|(acrimon)|(adactylism)|(addict)|(admonish)|(admonition)|(adulterat)|(adultery)|(advers)|(affectation)|(affected)|(affected manner)|(afflict)|(affright)...)

Here is the function that I'm using to match the stems in the paragraphs

def neg_stems(paragraph):
    stem_list = []
    i = " ".join(paragraph)
    for n in re.finditer(regex_neg, i):
        if n.group():
            stem_list.append(n.group())
    return json.dumps(stem_list)

And finally, here is just the general output that I'm getting

neg_stems(["the king abdicated the throne in an argument where he was angry, but his son was pretty happy about it","I hate cats but love hedgehogs"])

> ["abdicat", "argument", "anger", "hate"]

I'm just trying to count the number of negative words as defined by the semantic dictionary in regex_neg, but ~300 hours is just way too long, and even then, that's simply an estimate.

Does anyone have a suggestion on what I could do to try and speed this process up?

Thank you in advance!

Phil Robinson
  • 427
  • 1
  • 4
  • 12
  • 2
    I can't post this as an answer, because the answer itself would be too broad, but one option to consider here would be to load your text(s) into a database which supports full text search. Then, use FTS to find the stems in the text. Python perhaps is not as well suited for this type of operation as a database. – Tim Biegeleisen Mar 20 '19 at 13:59
  • 1
    Looks like a plain-text search to me, are you sure you need a regex? – Aaron Mar 20 '19 at 14:00
  • 1
    I'm not sure regex is well-suited to this problem. Regex is *great* as a tokenizer, but as far as matching stems, a dictionary of values (or a `set`) might be the best bet here – C.Nivs Mar 20 '19 at 14:03
  • 2
    [This thread](https://stackoverflow.com/questions/42742810/speed-up-millions-of-regex-replacements-in-python-3) is likely the answer. See [this post](https://stackoverflow.com/a/42789508/3832970). – Wiktor Stribiżew Mar 20 '19 at 14:04
  • Ok, thank you for the input guys! I really don't need to use it as a regex, so I'll go about with plain text matching a see how that progresses instead! – Phil Robinson Mar 20 '19 at 14:10

0 Answers0