Time performance in Generating very large text file in Python

Question

I need to generate a very large text file. Each line has a simple format:

Seq_num<SPACE>num_val
12343234 759

Let's assume I am going to generate a file with 100million lines. I tried 2 approaches and surprisingly they are giving very different time performance.

For loop over 100m. In each loop I make short string of seq_num<SPACE>num_val, and then I write that to a file. This approach takes a lot of time.
```
## APPROACH 1  
for seq_id in seq_ids:
    num_val=rand()
    line=seq_id+' '+num_val
    data_file.write(line)
```
For loop over 100m. In each loop I make short string of seq_num<SPACE>num_val, and then I append this to a list. When loop finishes, I iterate over list items and write each item to a file. This approach takes far less time.
```
## APPROACH 2  
data_lines=list()
for seq_id in seq_ids:
    num_val=rand()
    l=seq_id+' '+num_val
    data_lines.append(l)
for line in data_lines:
    data_file.write(line)
```

Note that:

Approach 2 has 2 loops instead of 1 loop.
I write to file in loop for both approach 1 and approach 2. So this step must be same for both.

So approach 1 must take less time. Any hints what I am missing?

Do you have 2 nested loops in approach 1? Can you provide at least some very simplified code? — FatihAkici, Mar 14 '18 at 02:20
Have you tried disabling the garbage collector with `gc.disable()`? — kevmo314, Mar 15 '18 at 22:58
I think you have a slow disk,so your approach 1 may has IO block and slow the program. — obgnaw, Mar 16 '18 at 03:15
Your two programs differ in when garbage collection is handled. In the former, python will garbage collect periodically as the string is freed immediately, whereas in the latter the garbage collector only runs at the end of the script. — kevmo314, Mar 16 '18 at 04:01
May I ask why you are writing such a large text file? All answers so far are about 2 orders of magnitudes slower than writing the data directly to binary files... The fastest way is always to avoid TextIO, which is often possible. — max9111, Mar 21 '18 at 15:59

score 17 · Answer 1 · edited Mar 22 '18 at 23:34

A lot and far less are technically very vague terms :) Basically if you can't measure it, you can't improve it.

For simplicity let's have a simple benchmark, loop1.py:

import random
from datetime import datetime

start = datetime.now()
data_file = open('file.txt', 'w')
for seq_id in range(0, 1000000):
        num_val=random.random()
        line="%i %f\n" % (seq_id, num_val)
        data_file.write(line)

end = datetime.now()
print("elapsed time %s" % (end - start))

loop2.py with 2 for loops:

import random
from datetime import datetime

start = datetime.now()
data_file = open('file.txt', 'w')
data_lines=list()
for seq_id in range(0, 1000000):
    num_val=random.random()
    line="%i %f\n" % (seq_id, num_val)
    data_lines.append(line)
for line in data_lines:
    data_file.write(line)

end = datetime.now()
print("elapsed time %s" % (end - start))

When I run these two scripts on my computers (with SSD drive) I'm getting something like:

$ python3 loop1.py 
elapsed time 0:00:00.684282
$ python3 loop2.py 
elapsed time 0:00:00.766182

Each measurement might be slightly different, but as would intuition suggest, the second one is slightly slower.

If we want to optimize writing time, we need to check the manual how Python implements writing into files. For text files the open() function should use BufferedWriter.The open function accepts 3rd arguments that is the buffer size. Here's the interesting part:

Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:

Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.

So, we can modify the loop1.py and use line buffering:

data_file = open('file.txt', 'w', 1)

this turns out to be very slow:

$ python3 loop3.py 
elapsed time 0:00:02.470757

In order to optimize writing time, we can adjust the buffer size to our needs. First we check the line size in bytes: len(line.encode('utf-8')), that gives me 11 bytes.

After updating the buffer size to our expected line size in bytes:

data_file = open('file.txt', 'w', 11)

I'm getting quite fast writes:

elapsed time 0:00:00.669622

Based on the details you've provided it's hard to estimate what's going on. Maybe the heuristic for estimating block size doesn't work well on your computer. Anyway if you're writing fixed line length, it's easy to optimize the buffer size. You could further optimize writing to files by leveraging flush().

Conclusion: Generally for faster writes into a file you should try to write a bulk of data that corresponds to a block size on your file system - which is exactly what is the Python method open('file.txt', 'w') is trying to do. In most cases you're safe with the defaults, differences in microbenchmarks are insignificant.

You're allocating large number of string objects, that needs to be collected by the GC. As suggested by @kevmo314, in order to perform a fair comparison you should disable the GC for loop1.py:

gc.disable()

As the GC might try to remove string objects while iterating over the loop (you're not keeping any reference). While the seconds approach keeps references to all string objects and GC collects them at the end.

amanb · Answer 2 · 2018-03-31T08:50:55.670

Below is an extension to the elegant answer by @Tombart and a few further observations.

With one goal in mind: optimizing the process of reading data from loop(s) and then writing it into a file, let's begin:

I will use the with statement to open/close the file test.txt in all cases. This statement automatically closes the file when the code block within it is executed.

Another important point to consider is the way Python processes text files based on Operating system. From the docs:

Note: Python doesn’t depend on the underlying operating system’s notion of text files; all the processing is done by Python itself, and is therefore platform-independent.

This means that these results may only slightly vary when executed on a Linux/Mac or Windows OS. The slight variation may result from other processes using the same file at the same time or multiple IO processes happening on the file during the script execution, general CPU processing speed among others.

I present 3 cases with execution times for each and finally find a way to further optimize the most efficient and quick case:

First case: Loop over range(1,1000000) and write to file

import time
import random

start_time = time.time()
with open('test.txt' ,'w') as f:
    for seq_id in range(1,1000000):
        num_val = random.random()    
        line = "%i %f\n" %(seq_id, num_val)
        f.write(line)

print('Execution time: %s seconds' % (time.time() - start_time)) 

#Execution time: 2.6448447704315186 seconds

Note: In the two list scenarios below, I have initialized an empty list data_lines like:[] instead of using list(). The reason is: [] is about 3 times faster than list(). Here's an explanation for this behavior: Why is [] faster than list()?. The main crux of the discussion is: While [] is created as bytecode objects and is a single instruction, list() is a separate Python object that also needs name resolution, global function calls and the stack has to be involved to push arguments.

Using the timeit() function in the timeit module, here's the comparison:

import timeit                 import timeit                     
timeit.timeit("[]")           timeit.timeit("list()")
#0.030497061136874608         #0.12418613287039193

Second Case: Loop over range(1,1000000), append values to an empty list and then write to file

import time
import random

start_time = time.time()
data_lines = []
with open('test.txt' ,'w') as f:
    for seq_id in range(1,1000000):
        num_val = random.random()    
        line = "%i %f\n" %(seq_id, num_val)
        data_lines.append(line)
    for line in data_lines:
        f.write(line)

print('Execution time: %s seconds' % (time.time() - start_time)) 

#Execution time: 2.6988046169281006 seconds

Third Case: Loop over a list comprehension and write to file

With Python's powerful and compact list comprehensions, it is possible to optimize the process further:

import time
import random

start_time = time.time()

with open('test.txt' ,'w') as f: 
        data_lines = ["%i %f\n" %(seq_id, random.random()) for seq_id in range(1,1000000)]
        for line in data_lines:
            f.write(line)

print('Execution time: %s seconds' % (time.time() - start_time))

#Execution time: 2.464804172515869 seconds

On multiple iterations, I've always received a lower execution time value in this case as compared to the previous two cases.

#Iteration 2: Execution time: 2.496004581451416 seconds

Now the question arises: why are list comprehensions( and in general lists ) faster over sequential for loops?

An interesting way to analyze what happens when sequential for loops execute and when lists execute, is to disassemble the code object generated by each and examine the contents. Here is an example of a list comprehension code object disassembled:

#disassemble a list code object
import dis
l = "[x for x in range(10)]"
code_obj = compile(l, '<list>', 'exec')
print(code_obj)  #<code object <module> at 0x000000058DA45030, file "<list>", line 1>
dis.dis(code_obj)

 #Output:
    <code object <module> at 0x000000058D5D4C90, file "<list>", line 1>
  1           0 LOAD_CONST               0 (<code object <listcomp> at 0x000000058D5D4ED0, file "<list>", line 1>)
          2 LOAD_CONST               1 ('<listcomp>')
          4 MAKE_FUNCTION            0
          6 LOAD_NAME                0 (range)
          8 LOAD_CONST               2 (10)
         10 CALL_FUNCTION            1
         12 GET_ITER
         14 CALL_FUNCTION            1
         16 POP_TOP
         18 LOAD_CONST               3 (None)
         20 RETURN_VALUE

Here's an example of a for loop code object disassembled in a function test:

#disassemble a function code object containing a `for` loop
import dis
test_list = []
def test():
    for x in range(1,10):
        test_list.append(x)


code_obj = test.__code__ #get the code object <code object test at 0x000000058DA45420, file "<ipython-input-19-55b41d63256f>", line 4>
dis.dis(code_obj)
#Output:
       0 SETUP_LOOP              28 (to 30)
              2 LOAD_GLOBAL              0 (range)
              4 LOAD_CONST               1 (1)
              6 LOAD_CONST               2 (10)
              8 CALL_FUNCTION            2
             10 GET_ITER
        >>   12 FOR_ITER                14 (to 28)
             14 STORE_FAST               0 (x)

  6          16 LOAD_GLOBAL              1 (test_list)
             18 LOAD_ATTR                2 (append)
             20 LOAD_FAST                0 (x)
             22 CALL_FUNCTION            1
             24 POP_TOP
             26 JUMP_ABSOLUTE           12
        >>   28 POP_BLOCK
        >>   30 LOAD_CONST               0 (None)
             32 RETURN_VALUE

The above comparison shows more "activity", if I may, in the case of a for loop. For instance, notice the additional function calls to the append() method in thefor loop function call. To know more about the parameters in the dis call output, here's the official documentation.

Finally, as suggested before, I also tested with file.flush() and the execution time is in excess of 11 seconds. I add f.flush() before the file.write() statement:

import os
.
.
.
for line in data_lines:
        f.flush()                #flushes internal buffer and copies data to OS buffer
        os.fsync(f.fileno())     #the os buffer refers to the file-descriptor(fd=f.fileno()) to write values to disk
        f.write(line)

The longer execution time using flush() can be attributed to the way data is processed. This function copies the data from the program buffer to the operating system buffer. This means that if a file(say test.txt in this case), is being used by multiple processes and large chunks of data is being added to the file, you will not have to wait for the whole data to be written to the file and the information will be readily available. But to make sure that the buffer data is actually written to disk, you also need to add: os.fsync(f.fileno()). Now, adding os.fsync() increases the execution time atleast 10 times(I didn't sit through the whole time!) as it involves copying data from buffer to hard disk memory. For more details, go here.

Further Optimization: It is possible to further optimize the process. There are libraries available that support multithreading, create Process Pools and perform asynchronous tasks . This is particularly useful when a function performs a CPU-intensive task & writes to file at the same time. For instance, a combination of threading and list comprehensions gives the fastest possible result(s):

import time
import random
import threading

start_time = time.time()

def get_seq():
    data_lines = ["%i %f\n" %(seq_id, random.random()) for seq_id in range(1,1000000)]
    with open('test.txt' ,'w') as f: 
        for line in data_lines:
            f.write(line)

set_thread = threading.Thread(target=get_seq)
set_thread.start()

print('Execution time: %s seconds' % (time.time() - start_time))

#Execution time: 0.015599966049194336 seconds

Conclusion: List comprehensions offer better performance in comparison to sequential for loops and list appends. The primary reason behind this is single instruction bytecode execution in the case of list comprehensions which is faster than the sequential iterative calls to append items to list as in the case of for loops. There is scope for further optimization using asyncio, threading & ProcessPoolExecutor(). You could also use combination of these to achieve faster results. Using file.flush() depends upon your requirement. You may add this function when you need asynchronous access to data when a file is being used by multiple processes. Although, this process may take a long time if you are also writing the data from the program's buffer memory to the OS's disk memory using os.fsync(f.fileno()).

Your Third approach is incorrect: you move random calculation out of the loop, which can impact significantly — Nikita Malyavin, Mar 20 '18 at 07:45
This answer started with the goal of optimizing the process of generating large text files. The third case also achieves the same result as the first two cases(it generates a large text file in the format requested) albeit faster. If the `random()` function is outside the list comprehension but improves performance, isn't that still meeting the goal? In general, `for` loops are slower than `list comprehensions` for the reasons explained. You can test this on your own too. — amanb, Mar 20 '18 at 08:35
No, 1. it changes the data generated. Although we don't know what's the OP's rand() function, it is clear that `rand` means random, and that each id should be paired with new random number. — Nikita Malyavin, Mar 20 '18 at 11:20
Thank you for pointing this out. I noticed that the `random()` value remains constant after num_val is called in the third case. This is because it is not part of the loop. I am editing the answer now. — amanb, Mar 20 '18 at 11:35
Answer to No.1: I've added `random.random()` to the list comprehension loop. This will make sure that a random number is generated on every iteration. I've tested this and it still gives better performance than the first two cases. — amanb, Mar 20 '18 at 11:49
Answer to No.2: The end goal should optimize the process when dealing with large text files and if the third approach gives a more compact code and takes less processing time, I don't see any reason why the comparison is not valid. The answer shows execution times for loop vs two loops vs list comprehensions. In a way, it attempts to answer the OP's question keeping optimization in mind. — amanb, Mar 20 '18 at 11:50
Yeah ok, now the performance does not differ that much at least. FYI changing random.random to random.randint makes 2x slower! My fastest result achieved by changing for-loop of writes to `f.write('\n'.join(lines))`, combined with your third case, but I guess it could be not that good for real 100MB case, so we need to split it by 1MB-chunks or smth like that. — Nikita Malyavin, Mar 20 '18 at 14:24
The last note is buffering is an optimization technique, so enabling it should generally improve performance. If you'll read @Tombart's answer more carefully (or just `open`'s documentation), You'll realize that buffering is enabled by default, and buffer size equals to `io.DEFAULT_BUFFER_SIZE`. Passing third argument overrides it. So when You pass 1, You make the buffer of size 1; thus disabling it instead, not enabling. I'm going to modify your answer in that place — Nikita Malyavin, Mar 20 '18 at 14:32
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/167244/discussion-between-user8212173-and-nikita-malyavin). — amanb, Mar 21 '18 at 08:26

eguaio · Accepted Answer · 2018-04-01T19:53:42.083

Considering APPROACH 2, I think I can assume you have the data for all the lines (or at least in big chunks) before you need to write it to the file.

The other answers are great and it was really formative to read them, but both focused on optimizing the file writing or avoiding the first for loop replacing with list comprehension (that is known to be faster).

They missed the fact that you are iterating in a for loop to write the file, which is not really necessary.

Instead of doing that, by increasing the use of memory (in this case is affordable, since a 100 million line file would be about 600 MB), you can create just one string in a more efficient way by using the formatting or join features of python str, and then write the big string to the file. Also relying on list comprehension to get the data to be formatted.

With loop1 and loop2 of @Tombart 's answer, I get elapsed time 0:00:01.028567 and elapsed time 0:00:01.017042, respectively.

While with this code:

start = datetime.now()

data_file = open('file.txt', 'w')
data_lines = ( '%i %f\n'%(seq_id, random.random()) 
                            for seq_id in xrange(0, 1000000) )
contents = ''.join(data_lines)
data_file.write(contents) 

end = datetime.now()
print("elapsed time %s" % (end - start))

I get elapsed time 0:00:00.722788 which is about a 25% faster.

Notice that data_lines is a generator expression, so the list is not really stored in memory, and the lines are generated and consumed on demand by the join method. This implies the only variable that is significantly occupying memory is contents. This also reduces slightly the running times.

If the text is to large to do all the work in memory, you can always separate in chunks. That is, formatting the string and writing to the file every million lines or so.

Conclusions:

Always try to do list comprehension instead of plain for loops (list comprehension is even faster than filter for filtering lists see here).
If possible by memory or implementation constraints, try to create and encode string contents at once, using the format or join functions.
If possible and the code remains readable, use built-in functions to avoid for loops. For example, using extend function of a list instead of iterating and using append. In fact, both previous points can be seen as examples of this remark.

Remark. Although this answer can be considered useful on its own, it does not completely address the question, which is why the two loops option in the question seems to run faster in some environments. For that, perhaps the @Aiken Drum's answer below can bring some light on that matter.

This prints: `%i %f %i %f %i %f %i %f` to the text file. The `%i %f` values are not replaced by `seq_id` and `random.random()` — amanb, Mar 22 '18 at 07:17
Thank you for catching the bug! I corrected the answer. The short times should have made me suspect that something was wrong. — eguaio, Mar 22 '18 at 09:58
This is definitely faster with the use of `join`. One point to notice: `xrange` is Python 2.7. For Python 3, use `range`. The `contents` variable may not be necessary, it works this way too: `data_file.write(''.join(data_lines))` — amanb, Mar 22 '18 at 11:05
I left `xrange` because python version was not required, and it is better to raise the exception and correct it in python 3 rather than leaving a `range` in python 2.7. About the need of the variable, you are right, but the code is more readable I think. — eguaio, Mar 22 '18 at 11:40
I also noticed that you used a generator function instead of a `list comprehension`. This also improved the performance. Great answer! — amanb, Mar 22 '18 at 12:15

Aiken Drum · Answer 4 · 2018-03-22T19:18:43.053

2

The other answers here give good advice, but I think the actual problem may be different:

I think the real issue here is the generational garbage collector is running more often with the single-loop code. The generational GC exists alongside the refcounting system, to periodically check for orphaned objects with nonzero self/cyclic-references.

The reason why this would be happening is probably complex, but my best guess is this:

With the single-loop code, each iteration is implicitly allocating a new string, then sending it off to be written to a file, after which it is abandoned, its refcount goes to zero, and thus it is deallocated. I believe cumulative alloc/dealloc traffic is part of the heuristic that decides when GC is done, so this behavior would be sufficient to set that flag every so many iterations. The flag, in turn, is probably checked any time your thread is going to be forced to wait for something anyway, because that's an excellent opportunity to fill wasted time with a garbage collection. Synchronous file writes are exactly that kind of opportunity.
With the dual-loop code, you are creating a string and adding it to the list, over and over, nothing else. Allocate, allocate, allocate. If you run out of memory, you're going to trigger a GC, but otherwise I doubt you're doing anything that's set up to check for opportunities to GC. There's nothing there to cause a thread wait, a context switch, etc. The second loop calls into the synchronous file I/O, where I think opportunistic GC can occur, but only the first call might trigger one, because there is no further memory allocation/deallocation at that point. Only after the entire list is written is the list itself deallocated, all at once.

I'm not in a position to test the theory myself right now, unfortunately, but you could try disabling the generational garbage collection and see whether or not it changes the execution speed of the single-loop version:

import gc
gc.disable()

I think that's all you'd need to do to confirm or disprove my theory.

edited Mar 22 '18 at 19:18

answered Mar 22 '18 at 10:36

Aiken Drum

573
1
3
24

1

After reading the question carefully, I realized this is the correct answer (provided the tests support the theory). Although the "chasing red herrings" phrase is a little impolite :) – eguaio Mar 22 '18 at 18:12
@eguaio - Oh, it wasn't meant to be impolite in any way. I just meant others had picked up the wrong lead without realizing it. I think I'll just remove that part of the comment if it sounds disrespectful. Thanks for bringing it to my intention. – Aiken Drum Mar 22 '18 at 18:25
@eguaio - I just realized you're one of the other answerers as well. I'm terribly sorry. I totally didn't mean to do that, nor did I actually think poorly of you. I blame those herrings! – Aiken Drum Mar 22 '18 at 18:29
2

... "thanks for bringing it to my intention" - ah, Freud, you strike again. – Aiken Drum Mar 22 '18 at 19:21
1

Aiken, please, don't worry, I'm really not offendended at all. I though the little smile at the end of my comment would make that clear. I just highlighted that because I know there are some people here in SO that is rather sensitive. – eguaio Mar 22 '18 at 20:20
1

@eguaio - Ah, thank you for letting me off of the hook. :) I have a history of saying things without enough consideration for how they will be heard. For several years, I have been making an effort to correct this character flaw. I'm glad to hear I didn't trouble you too much, but it's still a good reminder for me to take care. Cheers. :) – Aiken Drum Mar 22 '18 at 21:34
The most voted answer is not either the one that is really answering the question (this answer), nor the one that shows the best performance of the code (my answer). It would not be fair if it gets the bounty. – eguaio Mar 23 '18 at 11:18
1

@eguaio - Ah, it happens. I'd be chuffed to get the bounty, I guess, but I'm really not concerned. I'm just a pseudonym next to a picture of a gigantic rubber duck anyway; I don't care too much how large the number under my pseudonym is. I just have fun helping people figure out solutions to their problems. I read the bounty section because that's where the most interesting problems usually are. :) Most rewarding for me would be having the OP come back and confirm I'd gotten it right, honestly. XD – Aiken Drum Mar 23 '18 at 18:51
And the winner is.... hahaha, lol. @Aiken Drum, look! I got my answer as the correct, and the questioner clearly upvoted your answer too (the time coincidence will be to strange otherwise). The other upvote, as you can imagine, is from me. It seems the questioner waited so the loss would be only half of the bounty, very clever but not to nice. – eguaio Mar 28 '18 at 15:02
1

@eguaio - Ah well, what matters is spreading useful information. Your answer is good and a worthy choice even if technically not the right one in this specific case. I might ask that you either incorporate a note about GC into yours, post-mortem, or direct people to check out this answer as well. GC behavior is a good thing for people to learn about. I don't care too much about credit; I just want people to learn. – Aiken Drum Apr 01 '18 at 04:38
Done! Nice to meet people like you on this site. – eguaio Apr 01 '18 at 19:55

caot · Answer 5 · 2018-03-22T17:25:27.120

0

It could reduce time cost around half by changing the follows

for line in data_lines:
    data_file.write(line)

into:

data_file.write('\n'.join(data_lines))

Here is my test run range(0, 1000000)

elapsed time 0:00:04.653065
elapsed time 0:00:02.471547

2.471547 / 4.653065 = 53 %

However if 10 times the above range, there is no much difference.

edited Mar 22 '18 at 17:25

answered Mar 22 '18 at 17:00

caot

3,066
35
37

This is exactly one of the points of my answer, posted yesterday. – eguaio Mar 22 '18 at 18:13

Time performance in Generating very large text file in Python

5 Answers5

Linked