Finding sub-strings in LARGE string

Question

#read in csv file in form ("case, num, val \n case1, 1, baz\n...")
# convert to form FOO = "casenumval..." roughly 6 million characters
for someString in List: #60,000 substrings
    if substr not in FOO:
        #do stuff
    else: 
        #do other stuff

So my issue is that there are far too many sub strings to check against this massive string. I have tried reading the file in line by line and checking the substrings against the line, but this still crashes the program. Are there any techniques for checking a lot of substrings againsts a very large string efficiently?

FOR CONTEXT: I am performing data checks, suspect data is saved to a csv file to be reviewed/changed. This reviewed/changed file is then compared to the original file. Data which has not changed has been verified as good and must be saved to a new "exceptionFile". Data that has been changed and passes is disregarded. And data which has been changed and is checked and still suspect is the sent off for review again.

If `otherString` is actually a string, the loop would iterate over _individual characters_, not substrings. — ForceBru, Jun 30 '17 at 19:04
Did you read this question? https://stackoverflow.com/questions/1765579/fast-algorithm-for-searching-for-substrings-in-a-string — Idos, Jun 30 '17 at 19:05
It would help if you told us what "do stuff" and "do other stuff" need to know. For instance, does it matter _which_ of the substrings was found, or are you just looking for _any_ of them? — zwol, Jun 30 '17 at 19:05
You say "crashes the program." Can you describe what you mean, specifically? I can see why this would be slow, but not why it would fail. — Ned Batchelder, Jun 30 '17 at 19:05
So basically you want to check présence of 60000 needles in a 6M haystack? — xtofl, Jun 30 '17 at 19:07
@NedBatchelder I am running my script in PSSE, an application for power systems and it tends to not be able to handle large processes. — alexjones, Jun 30 '17 at 19:07
@downshift it makes no difference. Read the question/answers. — Idos, Jun 30 '17 at 19:07
@zwol I am basically adding the string to either a found or not found list — alexjones, Jun 30 '17 at 19:08
@ForceBru I was unclear at first. I am looping through strings in a list — alexjones, Jun 30 '17 at 19:09
@xtofl basically, yes. I am trying to find a way to compare these strings at earlier times so that I do not have to search through, let alone create the 6M haystack. Rather look for a needle in a manageable haystack many times. — alexjones, Jun 30 '17 at 19:11
@Idos I read this and my problems differs because I must keep track of every string which is found or not found. — alexjones, Jun 30 '17 at 19:18
Your "for context" statement makes me think you might _really_ be looking for the shell utilities `diff` or `comm`. — zwol, Jun 30 '17 at 19:21
@zwol I had not thought of using diff for this. that may be simpler than the current suggested answer. — alexjones, Jun 30 '17 at 19:24

score 2 · Answer 1 · answered Jun 30 '17 at 19:19

The first thing you should do is convert your list of 60,000 strings to search for into one big regular expression:

import re
searcher = re.compile("|".join(re.escape(s) for s in List)

Now you can search for them all at once:

for m in searcher.finditer(FOO):
    print(m.group(0))  # prints the substring that matched

If all you care about is knowing which ones were found,

print(set(m.group(0) for m in searcher.finditer(FOO))

This is still doing substantially more work than the absolute minimum, but it should be much more efficient than what you were doing before.

Also, if you know that your input is a CSV file and you also know that none of the strings-to-search-for contain a newline, you can operate line by line, which may or may not be faster than what you were doing depending on conditions, but will certainly use less memory:

with open("foo.csv") as FOO:
    for line in FOO:
        for m in searcher.finditer(line):
            # do something with the substring that matched

Thank you for your answer, I will test this immediately. – alexjones Jun 30 '17 at 19:20 — alexjones, Jun 30 '17 at 19:20

Finding sub-strings in LARGE string

1 Answers1