Is there a better implementation for:
file = open("./myFile.txt","w+")
for doc in allDocuments: #around 35 million documents
file.write(str(doc.number)+"\r\n")
f.close()
Currently this implementation is taking 20sec per file
Is there a better implementation for:
file = open("./myFile.txt","w+")
for doc in allDocuments: #around 35 million documents
file.write(str(doc.number)+"\r\n")
f.close()
Currently this implementation is taking 20sec per file
The I/O isn't really the problem, nor can you do anything useful about it: the file object and the operating system will both be doing some form of buffering. What you can do something about is the number of method calls you make.
with open("./myFile.txt", "w", newline='\r\n') as f:
f.writelines(f'{doc.number}\n' for doc in allDocuments)
The writelines
method takes an iterable of strings to write to the file. (The documentation says "list", but it seems to be speaking of lists, not list
s; a generator expression seems to work as well.) The generator expression produces each line on demand for writelines
.
Here's a test:
import random
def allDocuments():
for i in range(35_000_000):
yield random.randint(0, 100)
with open("tmp.txt", "w", newline='\r\n') as f:
f.writelines(f'{doc}\n' for doc in allDocuments())
It completed in 75 seconds (most of that due to the repeated calls to random.randint
), using less than 6MB of memory. (Replacing the call to random.randint
with a constant value dropped the running time to under 30 seconds.)
You could buffer the output and write in blocks:
with open("./myFile.txt", "w") as output:
lines = []
for doc in allDocuments:
lines.append(f"{doc.number}\r\n")
if len(lines) > 1000:
output.writelines(lines)
lines = []
output.writelines(lines)