22

I've profiled some legacy code I've inherited with cProfile. There were a bunch of changes I've already made that have helped (like using simplejson's C extensions!).

Basically this script is exporting data from one system to an ASCII fixed-width file. Each row is a record, and it has many values. Each line is 7158 characters and contains a ton of spaces. The total record count is 1.5 million records. Each row is generated one at a time, and takes a while (5-10 rows a second).

As each row is generated it's written to disk as simply as possible. The profiling indicates that about 19-20% of the total time is spent in file.write(). For a test case of 1,500 rows that's 20 seconds. I'd like to reduce that number.

Now it seems the next win will be reducing the amount of time spent writing to disk. I'd like to reduce it, if possible. I can keep a cache of records in memory, but I can't wait until the end and dump it all at once.

fd = open(data_file, 'w')
for c, (recordid, values) in enumerate(generatevalues()):
        row = prep_row(recordid, values)
        fd.write(row)
        if c % 117 == 0:
                if limit > 0 and c >= limit:
                        break
                sys.stdout.write('\r%s @ %s' % (str(c + 1).rjust(7), datetime.now()))
                sys.stdout.flush()

My first thought would be to keep a cache of records in a list and write them out in batches. Would that be faster? Something like:

rows = []
for c, (recordid, values) in enumerate(generatevalues()):
        rows.append(prep_row(recordid, values))
        if c % 117 == 0:
            fd.write('\n'.join(rows))
            rows = []

My second thought would be to use another thread, but that makes me want to die inside.

chmullig
  • 13,006
  • 5
  • 35
  • 52
  • What is the bottleneck of the application? – Lasse V. Karlsen Feb 10 '11 at 19:26
  • 2
    I thought I was clear. It's spending 20% of its time writing to disk one row at a time. – chmullig Feb 10 '11 at 19:27
  • Well, make the change and profile it? It's expect it to have little effect, as file I/O is usually line buffered... or so I assume. –  Feb 10 '11 at 19:29
  • 6
    Threading isn't going to help. Why are you flushing so much. I mean I know that LeBron James "never forgets to flush", but do you need to do so every time you write to stdout? – David Heffernan Feb 10 '11 at 19:31
  • 1
    @David, the flushing is for his stdout, not the file in question. – Winston Ewert Feb 10 '11 at 19:32
  • @Winston is right. I flush just for the stdout printing of status reports. That happens only every 117 lines, or about once every 20 seconds. – chmullig Feb 10 '11 at 19:38
  • This test case based on your code: http://pastebin.com/pyC8puAZ, runs in under a second. Whatever is causing slowness, its not cause by what you've posted. – Winston Ewert Feb 10 '11 at 20:42
  • A gotcha to be aware of with your suggestion to batch, if you happen to have a single record in a batch it will not write a newline, so `"\n".join(["a"]) -> "a"`. So e.g., if you have a single record again you will write `"\n".join("b") -> "b"`. This will write "ab" to your file. – user1847 May 01 '21 at 17:20

3 Answers3

36

Actually, your problem is not that file.write() takes 20% of your time. Its that 80% of the time you aren't in file.write()!

Writing to the disk is slow. There is really nothing you can do about it. It simply takes a very large amount of time to write stuff out to disk. There is almost nothing you can do to speed it up.

What you want is for that I/O time to be the biggest part of the program so that your speed is limited by the speed of the hard disk not your processing time. The ideal is for file.write() to have 100% usage!

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
Winston Ewert
  • 44,070
  • 10
  • 68
  • 83
  • +1, and normally I'd agree with you. However in this case the external process I'm using to generate the data is super slow, and I want to minimize everything else as much as possible. I suppose I should talk in seconds, and focus on reduce that. Will edit to clarify # of seconds. – chmullig Feb 10 '11 at 19:44
  • Updated to specify that it's taking 20 seconds to write 1,500 records. I really care about that number, I was just using the percents to prove that it's actually worth optimizing. – chmullig Feb 10 '11 at 19:50
  • @chmullig, what % is being spent waiting for the external process? – Winston Ewert Feb 10 '11 at 20:17
  • ~53% is spent waiting for external processes. Another ~3% is spent with some header type functions. 14% is spent in simplejson loads() 20% is spent in IO. So there's only 10% that's my code. – chmullig Feb 10 '11 at 20:38
  • 1
    @chmullig, something isn't right here. There is no way that writing should be taking that long. See if a python program can write other random strings to the disk faster. – Winston Ewert Feb 10 '11 at 20:56
22

Batching the writes into groups of 500 did indeed speed up the writes significantly. For this test case the writing rows individually took 21.051 seconds in I/O, while writing in batches of 117 took 5.685 seconds to write the same number of rows. Batches of 500 took a total of only 0.266 seconds.

chmullig
  • 13,006
  • 5
  • 35
  • 52
  • 3
    and now you can add threading for the file writing to do asynhronus writing – ted Aug 24 '12 at 08:53
  • Better yet - just output to stdout, pipe it to a file and let the OS do the buffering and batch write for you. :) – Lester Cheung Apr 03 '14 at 01:08
  • @LesterCheung, could you elaborate? – blindguy Nov 06 '15 at 18:16
  • @blindguy like "python yourprog.py | buffer > out" but I think batch output will still help here. – Lester Cheung Nov 09 '15 at 00:11
  • how does the performance compare to his approach? is the os buffering just as fast? – angryip Apr 28 '16 at 20:36
  • @ted Does the context switching between threads become negligible here? Should you have a thread checking a queue that will consume data? Or are you spawning a thread for `file.write` each time you want to batch write? Since the GIL is released during a file write, makes sense to just spawn and send data with it rather than using a consumer-producer. – pstatix Sep 23 '20 at 14:49
1

You can do mmap in python, which might help. But I suspect you did some mistake while profiling, because 7k * 1500 in 20 seconds is about 0.5 Mbytes/s. Do a test in which you write random lines with the same length, and you will see it's much faster than that.

tsg
  • 2,007
  • 13
  • 12