I have the following code that tries to process a huge file with multiple xml elements.
from shutil import copyfile
files_with_companies_mentions=[]
# code that reads the file line by line
def read_the_file(file_to_read):
list_of_files_to_keep=[]
f = open('huge_file.nml','r')
lines=f.readlines()
print("2. I GET HERE ")
len_lines = len(lines)
for i in range(0,len(lines)):
j=i
if '<?xml version="1.0"' in lines[i]:
next_line = lines[i+1]
write_f = open('temp_files/myfile_'+str(i)+'.nml', 'w')
write_f.write(lines[i])
while '</doc>' not in next_line:
write_f.write(next_line)
j=j+1
next_line = lines[j]
write_f.write(next_line)
write_f.close()
list_of_files_to_keep.append(write_f.name)
return list_of_files_to_keep
The file is over 700 MB large, with over 20 million rows. Is there a better way to handle it?
As you can see I need to reference to the previous and the next lines with an indicator variable such as i
.
The problem I am facing is that it is very slow. It takes more than 1 hour for every file and I have multiple of these.