2

I have code that iterates through files recursively looking for word from a list. If found it then prints out the file it was found in, the string that was searched, and line it was found on.

My issue is that when searching for api is also matches myapistring, 'pass' matches 'compass', 'dev' matches 'device' instead of the actual word. So I need to implement a regex somewhere, but I'm unsure as to where and on which part of the for loop.

The regex I have got that I (think) works is:

regex='([\w.]+)'

rootpath=myDir
wordlist=["api","pass","dev"]
exclude=["testfolder","testfolder2"]
complist=[]

for word in wordlist:
        complist.extend([re.compile(word)])

    for path,name,fname in os.walk(rootpath):
        name[:] = [d for d in name if d not in exclude]
        for fileNum in fname:
            i=path+"/"+fileNum
            files.append(i)

    for fileLine in files:
        if any(ext in fileLine for ext in exten):    
            count=0 
            for line in open(fileLine, "r").readlines():
                count=count+1
                for lv in complist:
                    match = lv.findall(line, re.IGNORECASE)

                    for mat in match: 
                        [print output]

Thanks

EDIT: Added this code as provided:

for word in wordlist:
        complist.extend([re.compile('\b' + re.escape(word) + '\b')])

Which works with a few errors, but good enough that I can work with.

Bob
  • 71
  • 7

1 Answers1

1

Instead of:

for word in wordlist:
    complist.extend([re.compile(word)])

Use word boundaries:

for word in wordlist:
    complist.extend([re.compile(r'\b{}\b'.format(word))])

The \b is a zero-length match for the start or end of a word, so \bthe\b will match this line:

the lazy dog

But not this line:

then I checked StackOverflow

Another thing I want to point out, is that if word contains any special-characters that mean something to the regex engine, they'll get interpreted as part of the regex. So, instead of:

complist.extend([re.compile(r'\b{}\b'.format(word))])

Use:

complist.extend([re.compile(r'\b{}\b'.format(re.escape(word)))])

Edit: As stated in the comments, you also want to match words separated by _. _ is considered a "word character" by Python, so, to include it as a word separator, you can do this:

re.compile(r'(?:\b|_){}(?:\b|_)'.format(re.escape(word)))

See this work here:

In [45]: regex = re.compile(r'(?:\b|_){}(?:\b|_)'.format(re.escape(word)))

In [46]: regex.search('this line contains is_admin')
Out[46]: <_sre.SRE_Match at 0x105bca3d8>

In [47]: regex.search('this line contains admin')
Out[47]: <_sre.SRE_Match at 0x105bca4a8>

In [48]: regex.search("does not have the word")

In [49]: regex.search("does not have the wordadminword")
Will
  • 24,082
  • 14
  • 97
  • 108
  • that give some weird results. it apparently matches: String '('u00b6', 'w')' found at line number 30, but u00b6 isn'tin my wordlist. it doesn't find words that are in the list, despite knowing they are there as re.compile(word) finds them – Bob Feb 02 '16 at 11:09
  • Sorry, try my edit! We needed `r'raw strings'` to keep python from interpreting `\b`. – Will Feb 02 '16 at 11:19
  • 1
    thanks, that works, but it misses something I was expecting. will complist.extend([re.compile(r'\b{}\b'.format(re.escape(word)))]) find is_admin if 'admin' is in the word list? Currently it's not, I'm guessing due to the underscore? – Bob Feb 02 '16 at 11:34
  • It will find `is-admin` but not `is_admin`, because `_` is considered a "word character". You could try something like `r'(?:\b|_){}(?:\b|_)'.format(re.escape(word))` or `r'\b_?{}_?\b'.format(re.escape(word))`. – Will Feb 02 '16 at 11:44
  • No problem, glad I could help! :) – Will Feb 02 '16 at 12:18