1

So I have the following regex that I've been told to edit for school -replaced specific terms with generic to keep it simple- which, to my understanding basically looks for one of the PRIMARYs, then

  1. One of the SECONDARYs then a TERTIARY within 20 words,
  2. One of the TERTIARYS then a SECONDARY within 20 words and says the document is a match to our desired theme if it succeeds.

We have a Python script that analyzes multiple articles based on this regex then gives us a Precision and Recall value by comparing the computer calculated matches to the hand coded matches to the theme we're trying to detect and generates HTMLs showing me error hits (computer yes/human no) and missed (computer no/human yes) hits.

I'm at the point where I'm at .99 Recall but like .47 Precision. I know that there are cases where we get ERRORs because articles that contain one of the words listed but in a context separate from what we're trying to detect, often correlated with BADWORDS. I'd like to tell the regex to only compute as a match if there are no BADWORDS, or at least no BADWORDS within 20 words of the others... how would I go about editing the regex to match?

\b(PRIMARY1|PRIMARY2|PRIMARY3...)\b| \b(SECONDARY1|SECONDARY2|SECONDARY3...)\b\s+(\S+\s+){0,20}\b(TERTIARY1|TERTIARY2|TERTIARY3...)\b |\b(TERTIARY1|TERTIARY2|TERTIARY3...)\b\s+(\S+\s+){0,20}\b(SECONDARY1|SECONDARY2|SECONDARY3...)\b

Yes, a friend has told me that regex really isn't the best way to do this but it's the toolset I've been told to use by my project lead so it's what I have to work with.

EDIT: I did do my Google homework and ran into ^((?!BADWORD1|BADWORD2|BADWORD3).)*$ and tried putting it both before everything and after everything but neither worked.

EDIT2:

with open(INPUT_FILE,'rU') as f:
    count=0
    total=0
    misses=[]
    errors=[]
    for r in csv.DictReader(f):
       ### if r['ecig']=='':
           ### continue

        count+=1
        text = normalize(r['ArticleContent'])
        result='Not relevant'
        feats = score_text(text, regexes)
        kw_coded = test_logic_policy(feats)
        hu_coded = (r['theme_code']) #or int(r['Commercial'])

        feats2 = dict([('kw_%s' % k , v ) for k,v in feats.items()])

        if kw_coded==1 and hu_coded=='YES':
            total+=1
            result = 'Correct'

        elif hu_coded=='YES' and kw_coded==0:
            result = 'Miss'
            highlighted_text = highlight(text, regexes)
            feats2.update({'tagged_text': highlighted_text[:CELL_MAX]})
            r.update(feats2)
            misses.append(r)

        elif kw_coded==1 and hu_coded=='NO':
            result = 'Error'
            highlighted_text = highlight(text, regexes)
            feats2.update({'tagged_text': highlighted_text[:CELL_MAX]})
            r.update(feats2)
            r['ArticleContent']=r['ArticleContent'][:CELL_MAX]

            errors.append(r)




        print 'Processing', r['ArticleID'], hu_coded, kw_coded, result


print '\nDONE\n\n'
# output misses and errors csv files
MISSES_FILE = INPUT_FILE.replace('.csv','_MISSES.csv')
ERRORS_FILE = INPUT_FILE.replace('.csv','_ERRORS.csv')'
Mariano
  • 6,423
  • 4
  • 31
  • 47
D. K.
  • 73
  • 7
  • Also, please use code formatting for the expressions. It's easier to read. I have already edited your question, but you should present clear questions to get better help. – Mariano Sep 23 '15 at 02:51
  • @Mariano so where would I put that in relation to what I have above? – D. K. Sep 23 '15 at 02:56
  • Please let me know if you're using `re` or `regex` (or show the part of your code where the match is done). There is a different approach for each. – Mariano Sep 23 '15 at 02:58
  • @Mariano The CSV file which contains the regex says regex for the column header, so I assume regex? – D. K. Sep 23 '15 at 03:58
  • Thanks for adding the code, though it's not the part where you attempt the regex call (probably `score_text()`). – Mariano Sep 24 '15 at 00:46

2 Answers2

1

To make a regex that matches lines that don't contain some word, you need to use a negative lookahead.

The full regex (discussed in this SO post) is:

^((?!word).)*$

...where 'word' is the word you want to avoid.

Community
  • 1
  • 1
alksdjg
  • 1,019
  • 2
  • 10
  • 26
  • I tried integrating this before with ^((?!BADWORD1|BADWORD2|BADWORD3).)*$ but that didn't work. – D. K. Sep 23 '15 at 02:47
0

The regex engine attemps a match starting at every character in the text until it succeeds. From the current position, you could try match if the BADWORD is in the next 20 words and, if it is, consume another 20 words. This will result in a successful match. However, if you don't create a backreference, you could ignore the match in your code for not having a capturing group.


This would be the idea:

\b(?:\w+\W+){0,20}(?:BADWORD1|BAD2)\b(?:\W+\w+){0,20}|...<rest of your pattern>

First, the regex engine will try to match a BADWORD. If it does, it returns a match with no captures, so you have to discard it and go for the next match (starting 20 words after the BADWORD). If it doesn't find it, then it can attempt to match the rest of the pattern (your regex).


Regex:

\b(?:\w+\W+){0,20}(?:BADWORD1|BAD2)\b(?:\W+\w+){0,20}|\b(PRIMARY1|P2|P3)\b|\b(SECONDARY1|S2|S3)\W+(?:\w+\W+){0,20}(TERTIARY1|T2|T3)\b|\b(TERTIARY1|T2|T3)\W+(?:\w+\W+){0,20}(SECONDARY1|S2|S3)\b

DEMO


EDIT:

As you can see, the above expression succeeds at matching a BADWORD. But it doesn't return a group(n) for that match. In your code, you can ignore this result when MatchObj.lastindex == 0.

Code:

import re


def search_words( regex, target):
    result = []
    for m in re.finditer( regex, target):
        #check if there is a capture
        if m.lastindex:
            result.append(m.group())

    return result


p = re.compile(r"""
        \b(?:\w+\W+){0,20}           # in the next 20 words
        (?:BADWORD1|BAD2)            # find a BADWORD
        \b(?:\W+\w+){0,20}           # and consume 20 more words avoiding a match there

        |\b(PRIMARY1|P2|P3)\b        # else, succeed if it matches a PRIMARY

        |\b(SECONDARY1|S2|S3)        # else, find a SECONDARY
           \W+(?:\w+\W+){0,20}       # nearly followed by
           (TERTIARY1|T2|T3)\b       # a TERTIARY

        |\b(TERTIARY1|T2|T3)         # or else, find a TERTIARY
           \W+(?:\w+\W+){0,20}       # nearly followed by
           (SECONDARY1|S2|S3)\b      # a SECONDARY
        """, re.IGNORECASE | re.VERBOSE)

text_list = [
             "PRIMARY1 word word SECONDARY1 word TERTIARY1",
             "word word word word word word word word word word word word word word word word word word word word word BADWORD1 word word word word word word word word word word PRIMARY1 . . . . . . . . . . . . . . . .",
             "word word word word word word word word word word word word word word word word word word word word word BADWORD1 word word word word word word word word word word word <MORE THAN 20 WORDS AWAY FROM BAD> word word word TERTIARY1 word word word word word word word word SECONDARY1. . . . . . . . . . . . . . . . . . . . . . . . . . . ."
             ]



for text in text_list:
    match_result = search_words( p, text)

    print("\nTEXT = %s" % text)
    print("MATCHES = %s" % (match_result if match_result else 'No'))

DEMO


You can't do the "ignore on no match" thing within the regex? I don't think there's anything in the Python for that.

There is a way, an ugly, unefficient way to do it within a single match, a real backtracking hell. However, this will only return one match. If you want to capture more than 1 word in a text, it will fail:

Regex:

^(?:\W*(?:\w+\W+){0,20}\b(?:BADWORD1|BAD2)\b(?:\W+\w+){20}|(?!\W*(?:\w+\W+){0,20}\b(?:BADWORD1|BAD2)\b)[\s\S])*?(?:\b(PRIMARY1|P2|P3)\b|\b(SECONDARY1|S2|S3)\W+(?:\w+\W+){0,20}(TERTIARY1|T2|T3)\b|\b(TERTIARY1|T2|T3)\W+(?:\w+\W+){0,20}(SECONDARY1|S2|S3)\b)

Note: this expression could result in catastrophic backtracking

DEMO

Mariano
  • 6,423
  • 4
  • 31
  • 47
  • **Note:** If you were using the [regex package](https://pypi.python.org/pypi/regex), you could simply use the control verb `(*SKIP)` and thus avoid checking for captures. – Mariano Sep 23 '15 at 04:31
  • I tried putting what you put but seem to be getting the same results. Furthermore, when I look in the HTML file that lists all the ERROR articles(computer yes human no), it seems to list all the articles that should've been discarded because of a BADWORD with the BADWORDs highlighted like a match (we have the python script set so that the words that match with our regex i.e. were the reasons that the text was marked Yes are highlighted). – D. K. Sep 23 '15 at 17:04
  • Are you positive you're actually ignoring a match if it doesn't return a capture? Maybe you can show us the significant part of the python code with the regex call. Can you edit your question to provide examples where it fails? – Mariano Sep 23 '15 at 17:11
  • You can't do the "ignore on no match" thing within the regex? I don't think there's anything in the Python for that. I've pasted the bit of the Python script that executes all the functions in the original Q – D. K. Sep 23 '15 at 20:26
  • It can be done in a single match attempt, and I added the expression in the answer. However, there is no point in doing so. You should really edit the code to get the desired result. – Mariano Sep 23 '15 at 23:56