I'm looking to implement a few lines of python, using re, to firstly manipulate a string then use that string as a regex search. I have strings with *
's in the middle of them, i.e. ab***cd
, with the *
's being any length. The aim of this is to do the regex search in a document to extract any lines that match the starting and finishing characters, with any number of characters in between. i.e. ab12345cd, abbbcd, ab_fghfghfghcd, would all be positive matches. Examples of negative matches: 1abcd, agcd, bb111cd.
I have come up with the regex of [\s\S]*?
to input instead of the *
's. So I want to get from an example string of ab***cd
to ^ab[\s\S]*?cd
, I will then use that for a regex search of a document.
I then wanted to open the file in mmap, search through it using the regex, then save the matches to file.
import re
import mmap
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
def searchFile(list_txt, raw_str):
search="^"+raw_str #add regex ^ newline operator
search_rgx=re.sub(r'\*+',r'[\\s\\S]*?',search) #replace * with regex function
#search file
with open(list_txt, 'r+') as f:
data = mmap.mmap(f.fileno(), 0)
results = re.findall(bytes(search_rgx,encoding="utf-8"),data, re.MULTILINE)
#save results
f1 = open('results.txt', 'w+b')
results_bin = b'\n'.join(results)
f1.write(results_bin)
f1.close()
print("Found "+str(file_len("results.txt"))+" results")
searchFile("largelist.txt","ab**cd")
Now this works fine with a small file. However when the file gets larger (1gb of text) I get this error:
Traceback (most recent call last):
File "c:\Programming\test.py", line 27, in <module>
searchFile("largelist.txt","ab**cd")
File "c:\Programming\test.py", line 21, in searchFile
results_bin = b'\n'.join(results)
MemoryError
Firstly - can anyone help optimize the code slightly? Am I doing something seriously wrong? I used mmap because I know I wanted to look at large files and I wanted to read the file line and by line rather than all at once (hence someone suggested mmap).
I've also been told to have a look at the pandas library for more data manipulation. Would panda's replace mmap?
Thanks for any help. I'm pretty new to python as you can tell - so appreciate any help.