Below is an extension to the elegant answer by @Tombart and a few further observations.
With one goal in mind: optimizing the process of reading data from loop(s) and then writing it into a file, let's begin:
I will use the with
statement to open/close the file test.txt
in all cases. This statement automatically closes the file when the code block within it is executed.
Another important point to consider is the way Python processes text files based on Operating system. From the docs:
Note: Python doesn’t depend on the underlying operating system’s notion of text files; all the processing is done by Python itself, and is therefore platform-independent.
This means that these results may only slightly vary when executed on a Linux/Mac or Windows OS.
The slight variation may result from other processes using the same file at the same time or multiple IO processes happening on the file during the script execution, general CPU processing speed among others.
I present 3 cases with execution times for each and finally find a way to further optimize the most efficient and quick case:
First case: Loop over range(1,1000000) and write to file
import time
import random
start_time = time.time()
with open('test.txt' ,'w') as f:
for seq_id in range(1,1000000):
num_val = random.random()
line = "%i %f\n" %(seq_id, num_val)
f.write(line)
print('Execution time: %s seconds' % (time.time() - start_time))
#Execution time: 2.6448447704315186 seconds
Note: In the two list
scenarios below, I have initialized an empty list data_lines
like:[]
instead of using list()
. The reason is: []
is about 3 times faster than list()
. Here's an explanation for this behavior: Why is [] faster than list()?. The main crux of the discussion is: While []
is created as bytecode objects and is a single instruction, list()
is a separate Python object that also needs name resolution, global function calls and the stack has to be involved to push arguments.
Using the timeit() function in the timeit module, here's the comparison:
import timeit import timeit
timeit.timeit("[]") timeit.timeit("list()")
#0.030497061136874608 #0.12418613287039193
Second Case: Loop over range(1,1000000), append values to an empty list and then write to file
import time
import random
start_time = time.time()
data_lines = []
with open('test.txt' ,'w') as f:
for seq_id in range(1,1000000):
num_val = random.random()
line = "%i %f\n" %(seq_id, num_val)
data_lines.append(line)
for line in data_lines:
f.write(line)
print('Execution time: %s seconds' % (time.time() - start_time))
#Execution time: 2.6988046169281006 seconds
Third Case: Loop over a list comprehension and write to file
With Python's powerful and compact list comprehensions, it is possible to optimize the process further:
import time
import random
start_time = time.time()
with open('test.txt' ,'w') as f:
data_lines = ["%i %f\n" %(seq_id, random.random()) for seq_id in range(1,1000000)]
for line in data_lines:
f.write(line)
print('Execution time: %s seconds' % (time.time() - start_time))
#Execution time: 2.464804172515869 seconds
On multiple iterations, I've always received a lower execution time value in this case as compared to the previous two cases.
#Iteration 2: Execution time: 2.496004581451416 seconds
Now the question arises: why are list comprehensions( and in general lists ) faster over sequential for
loops?
An interesting way to analyze what happens when sequential for
loops execute and when list
s execute, is to dis
assemble the code
object generated by each and examine the contents. Here is an example of a list comprehension code object disassembled:
#disassemble a list code object
import dis
l = "[x for x in range(10)]"
code_obj = compile(l, '<list>', 'exec')
print(code_obj) #<code object <module> at 0x000000058DA45030, file "<list>", line 1>
dis.dis(code_obj)
#Output:
<code object <module> at 0x000000058D5D4C90, file "<list>", line 1>
1 0 LOAD_CONST 0 (<code object <listcomp> at 0x000000058D5D4ED0, file "<list>", line 1>)
2 LOAD_CONST 1 ('<listcomp>')
4 MAKE_FUNCTION 0
6 LOAD_NAME 0 (range)
8 LOAD_CONST 2 (10)
10 CALL_FUNCTION 1
12 GET_ITER
14 CALL_FUNCTION 1
16 POP_TOP
18 LOAD_CONST 3 (None)
20 RETURN_VALUE
Here's an example of a for
loop code object disassembled in a function test
:
#disassemble a function code object containing a `for` loop
import dis
test_list = []
def test():
for x in range(1,10):
test_list.append(x)
code_obj = test.__code__ #get the code object <code object test at 0x000000058DA45420, file "<ipython-input-19-55b41d63256f>", line 4>
dis.dis(code_obj)
#Output:
0 SETUP_LOOP 28 (to 30)
2 LOAD_GLOBAL 0 (range)
4 LOAD_CONST 1 (1)
6 LOAD_CONST 2 (10)
8 CALL_FUNCTION 2
10 GET_ITER
>> 12 FOR_ITER 14 (to 28)
14 STORE_FAST 0 (x)
6 16 LOAD_GLOBAL 1 (test_list)
18 LOAD_ATTR 2 (append)
20 LOAD_FAST 0 (x)
22 CALL_FUNCTION 1
24 POP_TOP
26 JUMP_ABSOLUTE 12
>> 28 POP_BLOCK
>> 30 LOAD_CONST 0 (None)
32 RETURN_VALUE
The above comparison shows more "activity", if I may, in the case of a for
loop. For instance, notice the additional function calls to the append()
method in thefor
loop function call. To know more about the parameters in the dis
call output, here's the official documentation.
Finally, as suggested before, I also tested with file.flush()
and the execution time is in excess of 11 seconds
. I add f.flush() before the file.write()
statement:
import os
.
.
.
for line in data_lines:
f.flush() #flushes internal buffer and copies data to OS buffer
os.fsync(f.fileno()) #the os buffer refers to the file-descriptor(fd=f.fileno()) to write values to disk
f.write(line)
The longer execution time using flush()
can be attributed to the way data is processed. This function copies the data from the program buffer to the operating system buffer. This means that if a file(say test.txt
in this case), is being used by multiple processes and large chunks of data is being added to the file, you will not have to wait for the whole data to be written to the file and the information will be readily available. But to make sure that the buffer data is actually written to disk, you also need to add: os.fsync(f.fileno())
. Now, adding os.fsync()
increases the execution time atleast 10 times(I didn't sit through the whole time!) as it involves copying data from buffer to hard disk memory. For more details, go here.
Further Optimization: It is possible to further optimize the process. There are libraries available that support multithreading
, create Process Pools
and perform asynchronous
tasks . This is particularly useful when a function performs a CPU-intensive task & writes to file at the same time. For instance, a combination of threading
and list comprehensions
gives the fastest possible result(s):
import time
import random
import threading
start_time = time.time()
def get_seq():
data_lines = ["%i %f\n" %(seq_id, random.random()) for seq_id in range(1,1000000)]
with open('test.txt' ,'w') as f:
for line in data_lines:
f.write(line)
set_thread = threading.Thread(target=get_seq)
set_thread.start()
print('Execution time: %s seconds' % (time.time() - start_time))
#Execution time: 0.015599966049194336 seconds
Conclusion: List comprehensions offer better performance in comparison to sequential for
loops and list
append
s. The primary reason behind this is single instruction bytecode execution in the case of list comprehensions which is faster than the sequential iterative calls to append items to list as in the case of for
loops. There is scope for further optimization using asyncio, threading & ProcessPoolExecutor(). You could also use combination of these to achieve faster results. Using file.flush()
depends upon your requirement. You may add this function when you need asynchronous access to data when a file is being used by multiple processes. Although, this process may take a long time if you are also writing the data from the program's buffer memory to the OS's disk memory using os.fsync(f.fileno())
.