Fastest way to write huge data in file

Question

I am trying to create a random real, integers, alphanumeric, alpha strings and then writing to a file till the file size reaches 10MB.

The code is as follows.

import string
import random
import time
import sys


class Generator():
    def __init__(self):
        self.generate_alphabetical_strings()
        self.generate_integers()
        self.generate_alphanumeric()
        self.generate_real_numbers()

    def generate_alphabetical_strings(self):
        return ''.join(random.choice(string.ascii_lowercase) for i in range(12))

    def generate_integers(self):
        return ''.join(random.choice(string.digits) for i in range(12))

    def generate_alphanumeric(self):
        return ''.join(random.choice(self.generate_alphabetical_strings() +
                                     self.generate_integers()) for i in range(12))

    def _insert_dot(self, string, index):
        return string[:index].__add__('.').__add__(string[index:])


    def generate_real_numbers(self):
        rand_int_string = ''.join(random.choice(self.generate_integers()) for i in range(12))
        return self._insert_dot(rand_int_string, random.randint(0, 11))


from time import process_time
import os

a = Generator()

t = process_time()
inp = open("test.txt", "w")
lt = 10 * 1000 * 1000
count = 0
while count <= lt:
    inp.write(a.generate_alphanumeric())
    count += 39
inp.close()

elapsed_time = process_time() - t
print(elapsed_time)

It takes around 225.953125 seconds to complete. How can i improve the speed of this program? Please provide some code insights?

@MartijnPieters I tried the same code in Java and it took ~0.93 seconds. — ajknzhol, Dec 09 '14 at 16:48
The Java program wrote to the disk. I checked the size of the file manually after completion of the process. — ajknzhol, Dec 09 '14 at 16:51

Dr. Jan-Philip Gehrcke · Accepted Answer · 2014-12-09T17:14:50.057

49

Two major reasons for observed "slowness":

your while loop is slow, it has about a million iterations.
You do not make proper use of I/O buffering. Do not make so many system calls. Currently, you are calling write() about one million times.

Create your data in a Python data structure first and call write() only once.

This is faster:

t0 = time.time()
open("bla.txt", "wb").write(''.join(random.choice(string.ascii_lowercase) for i in xrange(10**7)))
d = time.time() - t0
print "duration: %.2f s." % d

Output: duration: 7.30 s.

Now the program spends most of its time generating the data, i.e. in random stuff. You can easily see that by replacing random.choice(string.ascii_lowercase) with e.g. "a". Then the measured time drops to below one second on my machine.

And if you want to get even closer to seeing how fast your machine really is when writing to disk, use Python's fastest (?) way to generate largish data before writing it to disk:

>>> t0=time.time(); chunk="a"*10**7; open("bla.txt", "wb").write(chunk); d=time.time()-t0; print "duration: %.2f s." % d
duration: 0.02 s.

edited Dec 09 '14 at 17:14

answered Dec 09 '14 at 16:52

Dr. Jan-Philip Gehrcke

33,287
14
85
130

3

What do you mean proper utilization of IO buffering? – ajknzhol Dec 09 '14 at 16:58
24

You are writing to disk. Writing to disk is a complex physical and logical process. It involves a lot of mechanics and control. It is *much* faster to tell the disk "Here, this is 10 MB of data, write it!" than telling it millions of times "Here, this is 1 byte of data, write it!". Therefore, the operating system has a mechanism to "collect" data that a process wants to write to disk before actually saving it to disk. However, if you explicitly tell the operating system to write small portions, then it does it. You are doing so, and this is slow. See my edit. – Dr. Jan-Philip Gehrcke Dec 09 '14 at 17:02
@Jan-PhilipGehrcke: Is there a way to create a buffered file writer? – Aaron Digulla Dec 10 '14 at 13:02
4

@AaronDigulla: if you do not specify the `buffering` parameter upon calling Python's `open()`, a (small) buffer is usually applied, according to the "system default". The size of this buffer is not documented. For some version of glibc, here someone determined it to be 8 kB: http://stackoverflow.com/a/18194856/145400. For certain applications it makes sense to increase that buffer size with the `buffering` parameter. No general statement possible, but benchmarks help. Sometimes it makes sense to explicitly collect data in memory first, via https://docs.python.org/2/library/stringio.html – Dr. Jan-Philip Gehrcke Dec 10 '14 at 16:14
Agree with the point in this answer. One important thing to note here is `buffering` parameter. For a smaller data set (hundreds or thousand items and total data is in KBs ) there is no significant difference in performance in either of the way. In my analysis, the time taken by calling write every time for each iteration came out be same (calculated to millisec precision) as the time taken for single write call. I have buffering set to -1 (which is default OS block size). – Rishi Sep 14 '16 at 21:19

score 2 · Answer 2 · answered Dec 09 '14 at 16:47

2

You literally create billions of objects which you then quickly throw away. In this case, it's probably better to write the strings directly into the file instead of concatenating them with ''.join().

answered Dec 09 '14 at 16:47

Aaron Digulla

321,842
108
597
820

score 1 · Answer 3 · answered Dec 09 '14 at 17:03

1

The while loop under main calls generate_alphanumeric, which chooses several characters out of (fresh randomly generated) strings composed of twelve ascii letters and twelve numbers. That's basically the same as choosing randomly either a random letter or a random number twelve times. That's your main bottleneck. This version will make your code one order of magnitude faster:

def generate_alphanumeric(self):
    res = ''
    for i in range(12):
        if random.randrange(2):
            res += random.choice(string.ascii_lowercase)
        else:
            res += random.choice(string.digits)
    return res

I'm sure it can be improved upon. I suggest you take your profiler for a spin.

answered Dec 09 '14 at 17:03

debiatan

111
4

No, this is not the main bottleneck. I agree that his way of generating the data is not optimal, but no no no, this is not the bottleneck of this program. His bottleneck is inefficient I/O. – Dr. Jan-Philip Gehrcke Dec 09 '14 at 17:04
The original run time (on my machine) is 0m28.587s. My version takes 0m2.266s. Which other change would you make that had a greater impact? – debiatan Dec 09 '14 at 17:05
Remove the while loop, invoke `write()` only once. – Dr. Jan-Philip Gehrcke Dec 09 '14 at 17:09
The majority of the time is already spent in unnecessary random generation. If I'm taking away 92% of the original run-time, how is that not the bottleneck? Once you solve that problem, I'm sure your suggestion will come in handy. – debiatan Dec 09 '14 at 17:15
Okay, let us agree on this: his data generation is really inefficient, and his I/O code is really inefficient. Which one of both is the major bottleneck depends on the system (on my SSD-powered system with a good CPU it is the I/O). – Dr. Jan-Philip Gehrcke Dec 09 '14 at 17:18
My benchmark has been run on a SSD system, too. And I still don't see how that reduces the cost of a O(n^2) routine down to O(n). – debiatan Dec 09 '14 at 17:23
Sorry, I just found that I was all the time considering his `generate_alphabetical_strings()` method which is not as bad (see my answer below). Indeed, when he uses `generate_alphanumeric()`, this is his main bottleneck. – Dr. Jan-Philip Gehrcke Dec 09 '14 at 17:34

Fastest way to write huge data in file

3 Answers3

Linked

Related