Python newbie here. I am trying to search through a large document and extract texts. The actual data I need is the value inside the paranthesis: xx(bb) xx(bb) xx can be any number combo and bb is any character inlcuding numbers so you can have a line that has 1(YJ) 2(*). The end goal is to compare the char inside the parenthesis with values in a set: {'hO', 'Ih', 'Dn', '8', 'MF', 'dC', '6', 'RE', 'WM', 'Dh', '5'}. So I would be chekcing to see if YJ and * are inside the set
To this purpose I have written a couple of methods to parse through the giant file. The issue is, this takes a long time. for a file around 1.5GB, it takes 49 secs. For files bigger than 5GB, it takes 5 minutes to search:
Method 1 Works and prints out the line but as all methods here, it is slow:
with open(filename, 'rb', buffering=102400) as f:
time_data_count = 0
for line in f:
# if b'(X,Y)' in line:
# print(line)
if re.search(b'\d+\(.+\)\s+\d+\(.+\)', line) :
print(line)
Method 2. Another issue with this method is that it always returns none. Why?
with open(filename, 'rb') as f:
#text = []
while True:
memcap = f.read(102400)
if not memcap:
break
text = re.search(b'\d+\(.+\)\s+\d+\(.+\)',memcap)
if text is None:
print("none")
Method 3: THis one only prints a list of 1 element unless the file is below 1GB. Why is this?:
with open(filename, 'rb') as f:
time_data_count = 0
text = []
while True:
memcap = f.read(102400)
if not memcap:
break
text = re.findall(b'\d+\(.+\)\s+\d+\(.+\)',memcap)
print(text)
So those are the three methods I have written. only 1 works as it should. But they all share the same issue of being slow. Is Python Regex just slow in general? Is there a different way to get the type of values I need without having to use Regex? I thought using binary and file buffer would help but this is as fast as it can go. Please help