Slow/hanging python regex on very large file

Question

I'm writing some python that involves searching a memory dump for potential URLs. The program seems to work fine for me, but a user that tested it said that it hanged on this potion of code for at least two days:

#carve urls
print "\nCarving potential URLs from correlated strings."
pattern = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
with open(outputPath+"\\strings\\correlatedstrings.txt", "r") as fstring:
    with createFile(outputPath+"\\strings", "urlSearch") as f:
        for stringline in fstring:
            if pattern.search(stringline):
                f.write(stringline)

The regex is one I found somewhere online (probably here on SO). It seems to work extremely well for me. A potential problem here is that the memory dump being searched through by this user is a whopping 32GB. Are there issues with using regex or my code on extremely large files? Any thoughts would be very helpful :).

Are you trying to search multiple strings in one file using regex. So, you are not finding the exact strings from the file?? Also, as your file is 32 GB, you have to read it chunk by chunk and operate on them to find the patterns. follow http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python http://stackoverflow.com/questions/6335839/python-how-to-read-n-number-of-lines-at-a-time?lq=1 — Tanmaya Meher, Jul 30 '14 at 06:09
I should have specified, but I am not operating directly on the memory dump. I am operating on a text file that is already split into lines, generated by the strings program in windows (with a little added information provided by Volatility). The link you, Tanmaya, posted seems to indicate that if a large file is already separated into lines, than operating on it line-by-line in the fashion I posted is correct. Also, I am just trying to copy any line from the text file that contains the url to another file. — Tom, Jul 30 '14 at 06:36
So, you are already working on a text file. was the size small when you work on it but now the user has a large size file ?! If you are reading text file so large; then you will have to employ some divide and conquer strategy, like chunk by chunk reading and searching method. Parallel processing (like using the `multiprocessing` module ) some chunks at a time might help a bit. At present I don't see any other option. — Tanmaya Meher, Jul 30 '14 at 06:55

Slow/hanging python regex on very large file

0 Answers0