How to search tens of thousands of items in a list of lists using hundreds of patterns

Question

I'm looking for advice on a better (faster) way to approach this. My problem is that as you increase the length of the "hosts" list the program takes exponentially longer to complete, and if "hosts" is long enough it takes so long for the program to complete that it seems to just lock up.

"hosts" is a list of lists that contains tens of thousands of items. When iterating through "hosts" i[0] will always be an IP address, i[4] will always be a 5 digit number, and i[7] will always be a multi-line string.
"searchPatterns" is a list of lists read in from a CSV file where elements i[0] through i[3] are regex search patterns (or the string "SKIP") and i[6] is a unique string used to identify a pattern match.

My current approach is to use the regex patterns from the CSV file to search through every multi-line list item contained in the "hosts" i[7] element. There are 100's of possible matches, and I need to identify all matches associated with each IP address and assign the unique string from the CSV file to identify all pattern matches. Finally, I need to put that information into the "fullMatchList" to use later.

NOTE: Even though each list item in "searchPatterns" has up to 4 patterns, I only need it to identify the first pattern found and then it can move on to the next list item to continue finding matches for that IP.

for i in hosts:
    if i[4] == "13579" or i[4] == "24680":
        for j in searchPatterns:
            for k in range(4):
                if j[k] == "SKIP":
                    continue
                else:
                    match = re.search(r'%s' % j[k], i[7], flags=re.DOTALL)
                    if match is not None:
                        if tempIP == "":
                            tempIP = i[0]
                            matchListPerIP.append(j[4])
                        elif tempIP == i[0]:
                            matchListPerIP.append(j[4])
                        elif tempIP != i[0]:
                            fullMatchList.append([tempIP, matchListPerIP])
                            tempIP = i[0]
                            matchListPerIP = []
                            matchListPerIP.append(j[4])
                        break
fullMatchList.append([tempIP, matchListPerIP])

Here's an example regex search pattern from the CSV file:
(?!(.*?)\br2\b)cpe:/o:microsoft:windows_server_2008:

That pattern is intended to identify Windows Server 2008, and includes a negative lookahead to avoid matching the R2 edition.

I'm new to Python so any advice is appreciated! Thank you!

As you have working code and you're just looking for performance improvements, maybe this is a good question for [Code Review](http://codereview.stackexchange.com/)? — glibdud, Feb 02 '17 at 13:20
Depending on the structure of your regex's, it might be possible to compress your Hosts and Regex's into two trees, and traverse the overlap of the trees. This requires simplifying "regex" to "string matching", but than you can traverse almost everything at the same time. — Tezra, Jun 06 '17 at 18:58

score 0 · Answer 1 · answered Mar 12 '19 at 13:27

The NIDS community has done a lot of work on testing the same string(s) (network packets) against a long list of regexes (firewall rules).

I haven't read the literature, but Coit et al.'s "Towards faster string matching for intrusion detection or exceeding the speed of Snort" appears to be a good starting point.

Quoting from the Introduction:

The basic string matching task that must be
performed by a NIDS is to match a number of patterns drawn from the NIDS rules to 
each packet or reconstructed TCP stream that the NIDS is analyzing. In Snort, the 
total number of rules available has become quite large, and continues to grow 
rapidly. As of 10/10/2000 there were 854 rules included in the “10102kany.rules” 
ruleset file [5]. 68 of these rules did not require content matching while 786 
relied on content matching to identify harmful packets. Thus, even though not 
every pattern string is applied to every stream, there are a large number of 
patterns being applied to some streams. For example, in traffic inbound to a web 
server, Snort v 1.6.3 with the snort.org ruleset, “10102kany.rules”, checks up to 
3 15 pattern strings against each packet. At the moment, it checks each pattern in 
turn using the Boyer-Moore algorithm. Since the patterns often have something in 
common, it seemed likely that there is considerable scope for efficiency 
improvements here, and so it has proved.

How to search tens of thousands of items in a list of lists using hundreds of patterns

1 Answers1

Linked