1

Based on the code supplied (simplified for this post), can someone help show how I can get a list (if 'list' is the correct type to use) of regex patterns to load from a text file and be matched to a single string?

There are many examples of loading text/text strings from a file and matching to a regex pattern but not the other way around - many regex patterns to one text string.

As you'll probably see in the code if I manually create a list and run re.compile I can use the list of patterns to match to the string. However where does the re.compile fit in when loading from a file?

import regex as re

fname = 'regex_strings_short.txt'

string_to_match = 'onload=alert'

# Create a manual list of regexes
manual_regexes = [
    re.compile(r'(?i)\bHP\b(?:[^.,;]{1,20}?)\bnumber\b'),
    re.compile(r'(?i)\bgmail\b(?:[^.,;]{1,20}?)\bnumber\b'),
    re.compile(r'(?i)\bearthlink\b(?:[^.,;]{1,20}?)\bnumber\b '),
    re.compile(r'(?i)onload=alert')
]

# Create a text file with these five example patterns
'''
(?i)\bHP\b(?:[^.,;]{1,20}?)\bnumber\b
(?i)\bgmail\b(?:[^.,;]{1,20}?)\bnumber\b
(?i)\bearthlink\b(?:[^.,;]{1,20}?)\bnumber\b
(?i)onload=alert
(?i)hello
'''

# Import a list of regex patterns from the created file
with open(fname, 'r') as file:
    imported_regexes = file.readlines()

# Notice the difference in the formatting of the manual list with 'regex.Regex' and 'flags=regex.I | regex.V0' wrapping each item
print(manual_regexes)
print('---')
print(imported_regexes)

# A match is found in the manual list, but no match found in the imported list
if re.match(imported_regexes[3], my_string):
    print('Match found in imported_regexes.')
else:
    print('No match in imported_regexes.')

print('---')

if re.match(manual_regexes[3], my_string):
    print('Match found in manual_regexes.')
else:
    print('No match in manual_regexes.')

There is no match for imported_regexes but there is for manual_regexes.

UPDATE: The code below is the final solution that worked for me. Posting it as it may help someone landing here and needing a solution.

# You must use regex as re and not just 'import re' as \p{} is not correctly escaped

import regex as re



# Add the post/string to match below

my_string = '<p>HP Support number</p>'



fname = 'regex_strings.txt'



# Contents of text file similar to the below

# but without the leading # space - that's only because it's an inline comment here

# (?i)\bHP\b(?:[^.,;]{1,20}?)\bnumber\b

# (?i)\bgmail\b(?:[^.,;]{1,20}?)\bnumber\b

# (?i)】\b(?:[^.,;]{1,1000}?)\p{Lo}



# Import a list of regex patterns from a file

with open(fname, 'r', encoding="utf8") as f:

    loaded_patterns = f.read().splitlines()



# print(loaded_patterns)

print(len(loaded_patterns))



found = 0

for index, pattern in enumerate (loaded_patterns):

    if re.findall(loaded_patterns[index],my_string):

        print('Match found. ' + loaded_patterns[index])

        found = 1



if found == 0:

    print('No matching regex found.')
MDR
  • 2,610
  • 1
  • 8
  • 18

1 Answers1

1

re.match accepts strings as well as compiled regex as arguments, and converts strings internally into compiled regex objects. You can call re.compile for optimization purpose (several times calling the same regex) but this is purely optional for program correctness.

If the program does not prints the imported regex are matching, it's because readlines() keeps trailing '\n' in your strings. Thus re.match('(?i)onload=alert\n') returns False with the string to match. You can call re.compile, or not, on the sanitized strings.

with open(fname, 'r') as file:
    imported_regexes = file.readlines()
print(re.match(imported_regexes[3].strip('\n'), string_to_match))

Outputs a matchobject.

Diane M
  • 1,503
  • 1
  • 12
  • 23