findall() regex when iterating through files looking for word from list

Question

I have code that iterates through files recursively looking for word from a list. If found it then prints out the file it was found in, the string that was searched, and line it was found on.

My issue is that when searching for api is also matches myapistring, 'pass' matches 'compass', 'dev' matches 'device' instead of the actual word. So I need to implement a regex somewhere, but I'm unsure as to where and on which part of the for loop.

The regex I have got that I (think) works is:

regex='([\w.]+)'

rootpath=myDir
wordlist=["api","pass","dev"]
exclude=["testfolder","testfolder2"]
complist=[]

for word in wordlist:
        complist.extend([re.compile(word)])

    for path,name,fname in os.walk(rootpath):
        name[:] = [d for d in name if d not in exclude]
        for fileNum in fname:
            i=path+"/"+fileNum
            files.append(i)

    for fileLine in files:
        if any(ext in fileLine for ext in exten):    
            count=0 
            for line in open(fileLine, "r").readlines():
                count=count+1
                for lv in complist:
                    match = lv.findall(line, re.IGNORECASE)

                    for mat in match: 
                        [print output]

Thanks

EDIT: Added this code as provided:

for word in wordlist:
        complist.extend([re.compile('\b' + re.escape(word) + '\b')])

Which works with a few errors, but good enough that I can work with.

http://stackoverflow.com/questions/15863066/python-regular-expression-match-whole-word — Harpreet Singh, Feb 02 '16 at 10:51
thanks, but that doesn't help me of where to put the regex so that it only finds the whole word in the line rather than an instance of the word. — Bob, Feb 02 '16 at 10:53
I don't know python, but I can guess after this line: "for line in open(fileLine, "r").readlines():" with line as "re.search(r'\bis\b', line)" — Harpreet Singh, Feb 02 '16 at 10:56

Will · Accepted Answer · 2016-02-02T11:53:37.740

1

Instead of:

for word in wordlist:
    complist.extend([re.compile(word)])

Use word boundaries:

for word in wordlist:
    complist.extend([re.compile(r'\b{}\b'.format(word))])

The \b is a zero-length match for the start or end of a word, so \bthe\b will match this line:

the lazy dog

But not this line:

then I checked StackOverflow

Another thing I want to point out, is that if word contains any special-characters that mean something to the regex engine, they'll get interpreted as part of the regex. So, instead of:

complist.extend([re.compile(r'\b{}\b'.format(word))])

Use:

complist.extend([re.compile(r'\b{}\b'.format(re.escape(word)))])

Edit: As stated in the comments, you also want to match words separated by _. _ is considered a "word character" by Python, so, to include it as a word separator, you can do this:

re.compile(r'(?:\b|_){}(?:\b|_)'.format(re.escape(word)))

See this work here:

In [45]: regex = re.compile(r'(?:\b|_){}(?:\b|_)'.format(re.escape(word)))

In [46]: regex.search('this line contains is_admin')
Out[46]: <_sre.SRE_Match at 0x105bca3d8>

In [47]: regex.search('this line contains admin')
Out[47]: <_sre.SRE_Match at 0x105bca4a8>

In [48]: regex.search("does not have the word")

In [49]: regex.search("does not have the wordadminword")

edited Feb 02 '16 at 11:53

answered Feb 02 '16 at 10:59

Will

24,082
14
97
108

that give some weird results. it apparently matches: String '('u00b6', 'w')' found at line number 30, but u00b6 isn'tin my wordlist. it doesn't find words that are in the list, despite knowing they are there as re.compile(word) finds them – Bob Feb 02 '16 at 11:09
Sorry, try my edit! We needed `r'raw strings'` to keep python from interpreting `\b`. – Will Feb 02 '16 at 11:19
1

thanks, that works, but it misses something I was expecting. will complist.extend([re.compile(r'\b{}\b'.format(re.escape(word)))]) find is_admin if 'admin' is in the word list? Currently it's not, I'm guessing due to the underscore? – Bob Feb 02 '16 at 11:34
It will find `is-admin` but not `is_admin`, because `_` is considered a "word character". You could try something like `r'(?:\b|_){}(?:\b|_)'.format(re.escape(word))` or `r'\b_?{}_?\b'.format(re.escape(word))`. – Will Feb 02 '16 at 11:44
No problem, glad I could help! :) – Will Feb 02 '16 at 12:18

findall() regex when iterating through files looking for word from list

1 Answers1