So I have the following regex that I've been told to edit for school -replaced specific terms with generic to keep it simple- which, to my understanding basically looks for one of the PRIMARYs, then
- One of the SECONDARYs then a TERTIARY within 20 words,
- One of the TERTIARYS then a SECONDARY within 20 words and says the document is a match to our desired theme if it succeeds.
We have a Python script that analyzes multiple articles based on this regex then gives us a Precision and Recall value by comparing the computer calculated matches to the hand coded matches to the theme we're trying to detect and generates HTMLs showing me error hits (computer yes/human no) and missed (computer no/human yes) hits.
I'm at the point where I'm at .99 Recall but like .47 Precision. I know that there are cases where we get ERRORs because articles that contain one of the words listed but in a context separate from what we're trying to detect, often correlated with BADWORDS. I'd like to tell the regex to only compute as a match if there are no BADWORDS, or at least no BADWORDS within 20 words of the others... how would I go about editing the regex to match?
\b(PRIMARY1|PRIMARY2|PRIMARY3...)\b| \b(SECONDARY1|SECONDARY2|SECONDARY3...)\b\s+(\S+\s+){0,20}\b(TERTIARY1|TERTIARY2|TERTIARY3...)\b |\b(TERTIARY1|TERTIARY2|TERTIARY3...)\b\s+(\S+\s+){0,20}\b(SECONDARY1|SECONDARY2|SECONDARY3...)\b
Yes, a friend has told me that regex really isn't the best way to do this but it's the toolset I've been told to use by my project lead so it's what I have to work with.
EDIT: I did do my Google homework and ran into ^((?!BADWORD1|BADWORD2|BADWORD3).)*$
and tried putting it both before everything and after everything but neither worked.
EDIT2:
with open(INPUT_FILE,'rU') as f:
count=0
total=0
misses=[]
errors=[]
for r in csv.DictReader(f):
### if r['ecig']=='':
### continue
count+=1
text = normalize(r['ArticleContent'])
result='Not relevant'
feats = score_text(text, regexes)
kw_coded = test_logic_policy(feats)
hu_coded = (r['theme_code']) #or int(r['Commercial'])
feats2 = dict([('kw_%s' % k , v ) for k,v in feats.items()])
if kw_coded==1 and hu_coded=='YES':
total+=1
result = 'Correct'
elif hu_coded=='YES' and kw_coded==0:
result = 'Miss'
highlighted_text = highlight(text, regexes)
feats2.update({'tagged_text': highlighted_text[:CELL_MAX]})
r.update(feats2)
misses.append(r)
elif kw_coded==1 and hu_coded=='NO':
result = 'Error'
highlighted_text = highlight(text, regexes)
feats2.update({'tagged_text': highlighted_text[:CELL_MAX]})
r.update(feats2)
r['ArticleContent']=r['ArticleContent'][:CELL_MAX]
errors.append(r)
print 'Processing', r['ArticleID'], hu_coded, kw_coded, result
print '\nDONE\n\n'
# output misses and errors csv files
MISSES_FILE = INPUT_FILE.replace('.csv','_MISSES.csv')
ERRORS_FILE = INPUT_FILE.replace('.csv','_ERRORS.csv')'