Does size of a file affects the performance of the write in python

Question

I was trying to write around 5 billion line to a file using python. I have noticed that the performance of the writes are getting worse as the file is getting bigger.

For example at the beginning I was writing 10 million lines per second, after 3 billion lines, it writes 10 times slower than before.

I was wondering if this is actually related to the size of the file?

That is, do you think that the performance is getting better if I break this big file into the smaller ones or the size of the file does not affect the performance of the writes.

If you think it affects the performance can you please explain why?

--Some more info --

The memory consumption is the same (1.3%) all the time. The length of the lines are the same. So the logic is that I read one line from a file (lets call it file A). each line of the file A contains 2 tab separated value, if one of the values has some specific characteristics I add the same line to file B. This operation is O(1), I just convert the value to int and check if that value % someNumber is any of the 7 flags that I want.

Every time I read 10M lines from file A I output the line number. (Thats how I know the performance dropped). File B is the one which gets bigger and bigger and the writes to it gets slower.

The OS is Ubuntu.

How and when did you measure the write rate? Writes to disk are buffered in memory, so during a continuous stream of writes, you might notice periodic slow downs while writes are actually committed to disk. — chepner, Aug 31 '14 at 18:14
How fast your OS can get lines to files is not Python specific; there is nothing in the Python I/O code that'll affect performance based on filesize. — Martijn Pieters, Aug 31 '14 at 18:17
Could your lines grow in size? e.g., if newer lines are 10 times larger then (assuming the same MB/sec rate) the number of lines written should fall. — jfs, Aug 31 '14 at 18:59
Are you sure that is not your computer's memory filling up? What does the rest of the program look like? What OS? — dawg, Aug 31 '14 at 19:00
The memory consumption is the same (1.3%) all the time. The length of the lines are the same. So the logic is that I read one line from a file (lets call it file A). each line of the file A contains 2 tab separated value, if one of the values has some specific characteristics I add the same line to file B. Every time I read 10M lines from file A I output the line number. (Thats how I know the the performance dropped). File B is the one which gets bigger and bigger and the writes to it gets slower. The OS is Ubuntu. Thanks! — Bahar, Aug 31 '14 at 19:07
"if one of the values has some specific characteristics I add the same line to file B": could you be more specific about that test? Unless it's O(1), you could be in for a significant slowdown as time passed independent of file size. (One way to instantly check this would be to comment out the write itself: if it's test-related, you should still see the slowdown.) — DSM, Aug 31 '14 at 19:14
The operation is O(1), I just convert the value to int and check if that value % someNumber is any of the 7 flags that I want. — Bahar, Aug 31 '14 at 19:24
My vote is for @chapner's explanation. The disk cache it outside of process memory. You can see it w/ `free -h` in the cached column. You can run `iotstat /dev/ -m 5 1000` (show every 5 seconds, 1000 times - teak as you like) to show your disk's performance. If it goes up and levels off when your program slows, you know you are disk bound. — tdelaney, Aug 31 '14 at 19:45

score 3 · Accepted Answer · edited May 23 '17 at 11:44

With this Python script:

from __future__ import print_function
import time
import sys
import platform

if sys.version_info[0]==2:
    range=xrange

times=[]
results=[]
t1=time.time()
t0=t1
tgt=5000000000
bucket=tgt/10
width=len('{:,}  '.format(tgt))
with open('/tmp/disk_test.txt', 'w') as fout:
    for line in range(1,tgt+1):
        fout.write('Line {:{w},}\n'.format(line, w=width))
        if line%bucket==0:
            s='{:15,}   {:10.4f} secs'.format(line, time.time()-t1)
            results.append(s)
            print(s)
            t1=time.time()
    else:
        info=[platform.system(), platform.release(),sys.version, tgt, time.time()-t0]
        s='\n\nDone!\n{} {}\n{} \n\n{:,} lines written in {:10.3f} secs'.format(*info)
        fout.write('{}\n{}'.format(s, '\n'.join(results)))    

print(s)

Under Python 2 and OS X, prints:

    500,000,000     475.9865 secs
  1,000,000,000     484.6921 secs
  1,500,000,000     463.2881 secs
  2,000,000,000     460.7206 secs
  2,500,000,000     456.8965 secs
  3,000,000,000     455.3824 secs
  3,500,000,000     453.9447 secs
  4,000,000,000     454.0475 secs
  4,500,000,000     454.1346 secs
  5,000,000,000     454.9854 secs

Done!
Darwin 13.3.0
2.7.8 (default, Jul  2 2014, 10:14:46) 
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] 

5,000,000,000 lines written in   4614.091 secs

Under Python 3.4 and OS X:

    500,000,000     632.9973 secs
  1,000,000,000     633.0552 secs
  1,500,000,000     682.8792 secs
  2,000,000,000     743.6858 secs
  2,500,000,000     654.4257 secs
  3,000,000,000     653.4609 secs
  3,500,000,000     654.4969 secs
  4,000,000,000     652.9719 secs
  4,500,000,000     657.9033 secs
  5,000,000,000     667.0891 secs

Done!
Darwin 13.3.0
3.4.1 (default, May 19 2014, 13:10:29) 
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] 

5,000,000,000 lines written in   6632.965 secs

The resulting file is 139 GB. You can see that on a relatively empty disk (my /tmp path is a 3 TB volume) the times are linear.

My suspicion is that under Ubuntu, you are running into the OS trying to keep that growing file contiguous on an EXT4 disk.

Recall that both OS X's HFS+ and Linux's EXT4 file system use allocate-on-flush disc allocation schemes. The Linux OS will also attempt to actively move files to allow the allocation to be contiguous (not fragmented)

For Linux EXT4 -- you can preallocate larger files to reduce this effect. Use fallocate as shown in this SO post. Then rewind the file pointer in Python and overwrite in place.

You may be able to use the Python truncate method to create the file, but the results are platform dependent.

Something similar to (pseudo code):

 def preallocate_file(path, size):
     ''' Preallocate of file at "path" of "size" '''
     # use truncate or fallocate on Linux
     # Depending on your platform, You *may* be able to just the following
     # works on BSD and OS X -- probably most *nix:
     with open(path, 'w') as f:
        f.truncate(size)


 preallocate_file(fn, size)
 with open(fn, 'r+') as f:
     f.seek(0)        # start at the beginning 
     # write whatever
     f.truncate()     # erases the unused portion...

If you *really* want to be fancy -- write your own context manager that will act like the `with open(...) as f:` form but takes an additional argument to preallocate the size. Then do the truncate on exit from the context manager. Rockin'! — dawg, Sep 03 '14 at 03:59

score 2 · Answer 2 · answered Aug 31 '14 at 19:02

2

The code which can cause this is not part of Python. If you are writing to a file system type which has issues with large files, the code you need to examine is the file system driver.

For workarounds, experiment with different file systems for your platform (but then this is no longer a programming question, and hence doesn't belong on StackOverflow).

answered Aug 31 '14 at 19:02

tripleee

175,061
34
275
318

The answer could be made more specific if your question mentioned which OS you are hosting the Python code on, and the type of the file system you are writing to. – tripleee Aug 31 '14 at 19:04

score 0 · Answer 3 · edited May 23 '17 at 12:31

As you say after 3 billion line you faced with a crash on your performance and your memory is the same (1.3%) all the time ! and As other guys mentioned , there is nothing in the Python I/O code that will affect performance based on filesize. so it may be happen because of a software issue (OS) or hardware issue ! for solve this problem i suggest the below ways :

Use $ time python yourprogram.py command for analyze your timing that shows you the below result :

real - refers to the actual elasped time user - refers to the amount of cpu time spent outside of kernel sys - refers to the amount of cpu time spent inside kernel specific functions

read more about real,user,sys in THIS stachoverflow answer by ConcernedOfTunbridgeWells.

Use a Line-by-line timing and execution frequency with a profiler, so line_profiler is an easy and unobtrusive way to profile your code and use to see how fast and how often each line of code is running in your scripts. you can install line_profiler that written by Robert Kern , you can install the python package via pip :

$ pip install line_profiler

read documentation HERE . also you can install memory_profiler for find How much memory does your lines use! install with this command :

$ pip install -U memory_profiler
$ pip install psutil

and documentation HERE

The last and more inportant way is find where’s the memory leak ? The cPython interpreter uses reference counting as it’s main method of keeping track of memory. This means that every object contains a counter, which is incremented when a reference to the object is stored somewhere, and decremented when a reference to it is deleted. When the counter reaches zero, the cPython interpreter knows that the object is no longer in use so it deletes the object and deallocates the occupied memory.

A memory leak can often occur in your program if references to objects are held even though the object is no longer in use.

The quickest way to find these “memory leaks” is to use an awesome tool called objgraph written by Marius Gedminas. This tool allows you to see the number of objects in memory and also locate all the different places in your code that hold references to these objects.

using pipe for install objgraph:

pip install objgraph

Once you have this tool installed, insert into your code a statement to invoke the debugger:

import pdb; pdb.set_trace()

Which objects are the most common?

At run time, you can inspect the top 20 most prevalent objects in your program by running: a result like this :

(pdb) import objgraph
(pdb) objgraph.show_most_common_types()

MyBigFatObject             20000
tuple                      16938
function                   4310
dict                       2790
wrapper_descriptor         1181
builtin_function_or_method 934
weakref                    764
list                       634
method_descriptor          507
getset_descriptor          451
type                       439

so read documentation HERE .

sources : http://mg.pov.lt/objgraph/#python-object-graphs

https://pypi.python.org/pypi/objgraph

http://www.appneta.com/blog/line-profiler-python/

https://sublime.wbond.net/packages/LineProfiler

http://www.huyng.com/posts/python-performance-analysis/

What do 'real', 'user' and 'sys' mean in the output of time(1)?