4

I have a 25GB file I need to process. Here is what I'm currently doing, but it takes an extremely long time to open:

collection_pricing = os.path.join(pricing_directory, 'collection_price')
with open(collection_pricing, 'r') as f:
    collection_contents = f.readlines()

length_of_file = len(collection_contents)

for num, line in enumerate(collection_contents):
    print '%s / %s' % (num+1, length_of_file)
    cursor.execute(...)

How could I improve this?

David542
  • 104,438
  • 178
  • 489
  • 842
  • 1
    Depends on what exactly you want to do with file content. Current code just prints the line numbers. – Ashwini Chaudhary Sep 16 '14 at 22:18
  • What you're showing right now is essentially a (very expensive) no-op. What it is that you actually do with the lines of the file? – NPE Sep 16 '14 at 22:19
  • if your processing is stateless (e.g. it doesn't matter what is on one line to the next, you'll just be parsing the data and or putting it somewhere else). Open the file and process it one line at a time via a buffer. If it does matter between the lines (for example, it matters what's on line 30 if you're on line 1000000) then you'll need to do the same but only store the important bits – Mike McMahon Sep 16 '14 at 22:20
  • 1
    Related: [Processing Large Files in Python(1000 GB or More)](http://stackoverflow.com/q/23765360/846892) – Ashwini Chaudhary Sep 16 '14 at 22:23

3 Answers3

8
  1. Unless the lines in your file is really, really big, do not print the progress at every line. Printing to a terminal is very slow. Print progress e.g. every 100 or every 1000 lines.

  2. Use the available operating system facilities to get the size of a file - os.path.getsize() , see Getting file size in Python?

  3. Get rid of readlines() to avoid reading 25GB into memory. Instead read and process line by line, see e.g. How to read large file, line by line in python

Community
  • 1
  • 1
nos
  • 223,662
  • 58
  • 417
  • 506
  • How would I then see the progress of the script? If I'm not using `readlines` ? – David542 Sep 16 '14 at 22:33
  • @David542 Same as you do now. See the first answer of the last link, there's a loop there too, which is where you do your work. e.g. print the progress. – nos Sep 16 '14 at 22:41
  • How would I get the number of lines in my file then? – David542 Sep 16 '14 at 22:43
  • @David542 Then you'd read the file twice. However if you only need to know this to print the progress of your application, just use the file size and count how many bytes you've read instead of lines. – nos Sep 16 '14 at 22:48
  • Got it, thanks. I can do `os.path.getsize()` to get the total size of the file, how do I get 'how far I am' along in the `for line in file` ? – David542 Sep 16 '14 at 23:35
3

Pass through the file twice: Once to count lines, once to do the printing. Never call readlines on a file that size -- you'll end up swapping everything to disk. (Actually, just never call readlines in general. It's silly.)

(Incidentally, I'm assuming that you're actually doing something with the lines, rather than just the number of lines -- the code you posted there doesn't actually use anything from the file other than the number of newlines in it.)

Sneftel
  • 40,271
  • 12
  • 71
  • 104
1

Combining the answers above, here is how I modified it.

size_of_file = os.path.getsize(collection_pricing)
progress = 0
line_count = 0

with open(collection_pricing, 'r') as f:
    for line in f:
        line_count += 1  
        progress += len(line)
        if line_count % 10000 == 0:
            print '%s / %s' % (progress, size_of_file)

This has the following improvements:

  • Doesn't use readlines() so not storing everything into memory
  • Only printing every 10,000 lines
  • Using size of file instead of line count to measure progress, so don't have to iterate files twice.
David542
  • 104,438
  • 178
  • 489
  • 842