I am a new programmer and we are working on a Graduate English project where we are trying to parse a gigantic dictionary text file (500 MB). The file is set up with html-like tags. I have 179 author tags eg. "[A>]Shakes.[/A]" for Shakespeare and what I need to do is find each occurrence of every tag and then write that tag and what follows on the line until I get to "[/W]".
My problem is that readlines() gives me a memory error (I am assuming because the file is so large) and I have been able to find matches (but only once) and not been able to get it to look past the first match. Any help that anyone could give would be greatly appreciated.
There are no new lines in the text file which I think is causing the problem. This problem has been solved. I thought I would include the code that worked:
with open('/Users/Desktop/Poetrylist.txt','w') as output_file:
with open('/Users/Desktop/2e.txt','r') as open_file:
the_whole_file = open_file.read()
start_position = 0
while True:
start_position = the_whole_file.find('<A>', start_position)
if start_position < 0:
break
start_position += 3
end_position = the_whole_file.find('</W>', start_position)
output_file.write(the_whole_file[start_position:end_position])
output_file.write("\n")
start_position = end_position + 4