So I'm trying to match a regular expression to a paragraph in order to do sentiment analysis, but tqdm is saying this could take about 300 hours. I was wondering if anyone has a critique on what I could do to improve the way my RE functions
I'm trying to match stem endings to negative words for this analysis. Here is a small snippet of the expression for the match. I'm only showing a small snippet because the entire expression contains about 2800 terms, and is set up entirely the same all the way through, hence the ellipses.
regex_neg = ((a lie)|(abandon)|(abas)|(abattoir)|(abdicat)|(aberra)|(abhor)|(abject)|(abnormal)|(abolish)|(abominab)|(abominat)|(abrasiv)|(absent)|(abstrus)|(absurd)|(abus)|(accident)|(accost)|(accursed)|(accusation)|(accuse)|(accusing)|(acerbi)|(ache)|(aching)|(achy)|(acomia)|(acrimon)|(adactylism)|(addict)|(admonish)|(admonition)|(adulterat)|(adultery)|(advers)|(affectation)|(affected)|(affected manner)|(afflict)|(affright)...)
Here is the function that I'm using to match the stems in the paragraphs
def neg_stems(paragraph):
stem_list = []
i = " ".join(paragraph)
for n in re.finditer(regex_neg, i):
if n.group():
stem_list.append(n.group())
return json.dumps(stem_list)
And finally, here is just the general output that I'm getting
neg_stems(["the king abdicated the throne in an argument where he was angry, but his son was pretty happy about it","I hate cats but love hedgehogs"])
> ["abdicat", "argument", "anger", "hate"]
I'm just trying to count the number of negative words as defined by the semantic dictionary in regex_neg
, but ~300 hours is just way too long, and even then, that's simply an estimate.
Does anyone have a suggestion on what I could do to try and speed this process up?
Thank you in advance!