I'm writing some python that involves searching a memory dump for potential URLs. The program seems to work fine for me, but a user that tested it said that it hanged on this potion of code for at least two days:
#carve urls
print "\nCarving potential URLs from correlated strings."
pattern = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
with open(outputPath+"\\strings\\correlatedstrings.txt", "r") as fstring:
with createFile(outputPath+"\\strings", "urlSearch") as f:
for stringline in fstring:
if pattern.search(stringline):
f.write(stringline)
The regex is one I found somewhere online (probably here on SO). It seems to work extremely well for me. A potential problem here is that the memory dump being searched through by this user is a whopping 32GB. Are there issues with using regex or my code on extremely large files? Any thoughts would be very helpful :).