0

I have the following code that tries to process a huge file with multiple xml elements.

from shutil import copyfile
files_with_companies_mentions=[]
# code that reads the file line by line
def read_the_file(file_to_read):
    list_of_files_to_keep=[]
    f = open('huge_file.nml','r')
    lines=f.readlines()
    print("2. I GET HERE ")
    len_lines = len(lines)
    for i in range(0,len(lines)):
        j=i
        if '<?xml version="1.0"' in lines[i]:
            next_line = lines[i+1]
            write_f = open('temp_files/myfile_'+str(i)+'.nml', 'w')
            write_f.write(lines[i])
            while '</doc>' not in next_line:
                write_f.write(next_line)
                j=j+1
                next_line = lines[j]
            write_f.write(next_line)    
            write_f.close()
            list_of_files_to_keep.append(write_f.name)
    return list_of_files_to_keep

The file is over 700 MB large, with over 20 million rows. Is there a better way to handle it?

As you can see I need to reference to the previous and the next lines with an indicator variable such as i.

The problem I am facing is that it is very slow. It takes more than 1 hour for every file and I have multiple of these.

Kara
  • 6,115
  • 16
  • 50
  • 57
adrCoder
  • 3,145
  • 4
  • 31
  • 56
  • what is the problem you are facing? disk space? – splinter Apr 19 '17 at 13:21
  • It is very slow. I edited my original post. – adrCoder Apr 19 '17 at 13:22
  • How about parallel processing to work on several of these files at the same time? – splinter Apr 19 '17 at 13:23
  • You can do 'for line in f:' and just step through the lines one at a time on demand without reading them all into memory first. You'd need to rework the logic that looks for by setting a boolean flag to indicate whether you're looking for it or not though. – Simon Hibbs Apr 19 '17 at 13:23
  • can you give me an answer with code so that I can try it? – adrCoder Apr 19 '17 at 13:23
  • Related: http://stackoverflow.com/questions/30294146/python-fastest-way-to-process-large-file From the link, this is the first thing you should do to improve your code: `with open() as infile: for line in infile:` – turnip Apr 19 '17 at 13:23
  • You could try reading it in lazily so it's not all in memory at once. If your computers getting short on memory, that could be causing it to slow down. – Carcigenicate Apr 19 '17 at 13:24
  • @adrCoder I added a code template to show you how to parallelize. You can do that, perhaps in conojuinction with other suggestions from the comments about the file I/O. – splinter Apr 19 '17 at 13:31
  • Reading the whole file in the memory and then jumping around that memory isn't so efficient. – ForceBru Apr 19 '17 at 13:32
  • Looks like you are looping over the whole file more than once. `for i in range(0,len(lines)):` and `while '' not in next_line:` – Kind Stranger Apr 19 '17 at 13:33
  • Can you guys give me a proper answer below with example of code I could use? – adrCoder Apr 19 '17 at 13:42

3 Answers3

0

You can use parallel processing for speeding up, using the joblib package. Assuming you have a list of files called files, the structure would be as follows:

import ...
from joblib import Parallel, delayed

def read_the_file(file):
    ...

if __name__ == '__main__':

    n = 8 # number of processors
    Parallel(n_jobs=n)(delayed(read_the_file)(file) for file in files)
splinter
  • 3,727
  • 8
  • 37
  • 82
0

First of all you shouldn't determine the total number of lines on its own or read the whole file at once if you dont need to. Use a loop like this and you'll already save some time. Plus consider this for usage of readlines() http://stupidpythonideas.blogspot.de/2013/06/readlines-considered-silly.html.

Considering you're working with XML elements maybe consider using a lib that makes this easier. especially for the writing.

Community
  • 1
  • 1
lpoorthuis
  • 111
  • 4
0
  1. suggestion: make use of a context manager:

    with open(filename, 'r') as file:
        ...
    
  2. suggestion: do the reading and processing junk-wise (currently, you are reading the file in a single step, just afterwards you go over the list "line by line"):

    for chunk in file.read(number_of_bytes_to_read):
        my_function(chunk)
    

Of course this way you have to look out for correct xml tag start/ends.

Alternative: look for an XML Parser package. I am quite certain there is one that can process files chunk-wise, with correct tag-handling included.

akoeltringer
  • 1,671
  • 3
  • 19
  • 34