2

Is there a better implementation for:

file = open("./myFile.txt","w+")
for doc in allDocuments: #around 35 million documents
    file.write(str(doc.number)+"\r\n")
f.close()

Currently this implementation is taking 20sec per file

Leonardo
  • 10,737
  • 10
  • 62
  • 155
  • 2
    You are calling `.write` millions of times. It would be more efficient to call it fewer times, ideally once. – jkr Feb 19 '20 at 19:28
  • Using `w+` is an error for text files. – Daniel Feb 19 '20 at 19:28
  • @Daniel why is that? can you please expand? – Leonardo Feb 19 '20 at 19:29
  • 1
    The `+` is for reading and writing with random access, which is not practical for text files. – Daniel Feb 19 '20 at 19:31
  • 1
    @Leonardo `w+` opens a file for reading _and writing_, whereas `w` opens only for writing. – jkr Feb 19 '20 at 19:31
  • @I thought `w+` was write and overwrite – Leonardo Feb 19 '20 at 19:34
  • 1
    `w+` lets you read and write from the same handle. You can always use `seek` to change where the next write occurs. *All* writes overwrite whatever previous content was at the current position; you can't just insert new bytes into the middle of a file. – chepner Feb 19 '20 at 19:40
  • How much is there to write in total? Is simply creating the entire string first a viable option? – AMC Feb 19 '20 at 19:48
  • @chepner - But its only safe to `seek(0)` because of the difference in size between unencoded bytes and encoded characters (and the buffering python does for the encoding). Maybe we should say its _usually_ an error to open a text file in "w+". In OP's case, the file isn't seeked or read so the rule of least privilege applies. – tdelaney Feb 19 '20 at 20:04
  • That's what I (very indirectly) alluded to with "can't just insert new bytes into the middle", but that ignores the possibility not just of overwriting more than you meant but of starting a write in the middle of a (e.g.) UTF-8 byte. – chepner Feb 19 '20 at 20:06
  • @Leonardo - "w" always truncates the file. Add an "+" to also make it readable. The "+" is only useful in the rare case that you want to rewind the file and read it from the beginning. – tdelaney Feb 19 '20 at 20:13

2 Answers2

3

The I/O isn't really the problem, nor can you do anything useful about it: the file object and the operating system will both be doing some form of buffering. What you can do something about is the number of method calls you make.

with open("./myFile.txt", "w", newline='\r\n') as f:
    f.writelines(f'{doc.number}\n' for doc in allDocuments)

The writelines method takes an iterable of strings to write to the file. (The documentation says "list", but it seems to be speaking of lists, not lists; a generator expression seems to work as well.) The generator expression produces each line on demand for writelines.


Here's a test:

import random

def allDocuments():
    for i in range(35_000_000):
        yield random.randint(0, 100)

with open("tmp.txt", "w", newline='\r\n') as f:
    f.writelines(f'{doc}\n' for doc in allDocuments())

It completed in 75 seconds (most of that due to the repeated calls to random.randint), using less than 6MB of memory. (Replacing the call to random.randint with a constant value dropped the running time to under 30 seconds.)

chepner
  • 497,756
  • 71
  • 530
  • 681
  • 1
    +1. the use of a generator in `file.writelines` has been proposed on stack overflow in the past by the mighty martijn pieters https://stackoverflow.com/a/44560826/5666087 – jkr Feb 20 '20 at 02:30
1

You could buffer the output and write in blocks:

with open("./myFile.txt", "w") as output:
    lines = []
    for doc in allDocuments:
        lines.append(f"{doc.number}\r\n")
        if len(lines) > 1000:
            output.writelines(lines)
            lines = []
    output.writelines(lines)
Daniel
  • 42,087
  • 4
  • 55
  • 81
  • Great answer. I would also experiment with different numbers. The answer of 1,000, try 10,000, 100,000, etc. and see which works best for you. :) – sniperd Feb 19 '20 at 19:36
  • 1
    So, python `open` returns a buffered file object already. Not sure if this is necessary – juanpa.arrivillaga Feb 19 '20 at 19:38
  • 2
    Since files are already buffered, it's not the case that every call to `write` actually incurs disk I/O. All this does is trade a call to `output.write` for a call to `lines.append`. – chepner Feb 19 '20 at 19:44
  • So, it's probably easier to just modify the buffering parameter, which defaults to line buffering for text files, which doesn't seem ideal in this particular case – juanpa.arrivillaga Feb 19 '20 at 19:45
  • 1
    I agree with @chepner. And with that, I believe his answer (https://stackoverflow.com/a/60308140/5666087) is superior. – jkr Feb 19 '20 at 19:58
  • @chepner if you can't tell the difference between `output.write` and `lines.append` I'd suggest you to replicate the original question and test it for yourself. I consider this to be a better approach than yours bellow because it's ALOT more memory friendlier. Your suggestion is so memory-unfriendly that I can't even run it on my default size container (1 core, 512ram) – Leonardo Feb 19 '20 at 21:30
  • 1
    I may have been mistaken in my belief that `writelines` would write each line as it reads it from the generator; I will check that. However, I never said that `output.write` and `lines.append` were the *same*, just that for each call to `output.write` you eliminate you just replace it with another function call. The total number of calls hasn't decreased. – chepner Feb 19 '20 at 21:37
  • 1
    I wrote a generator `allDocuments` that simply yielded the number `5` 350,000,000 times, then called `f.writelines(f'{doc}\r\n' for doc in allDocuments())`. Its memory usage is constant around 5MB. You didn't use a list comprehension instead of a generator expression, did you? It also completed in under 3 minutes. – chepner Feb 19 '20 at 21:48
  • @Leonardo @chepner When using `open()` with a text mode, it returns a [`TextIOWrapper`](https://github.com/python/cpython/blob/44c690112d96a81fe02433de7900a4f8f9457012/Lib/_pyio.py#L1966) object, a child of [`TextIOBase`](https://github.com/python/cpython/blob/44c690112d96a81fe02433de7900a4f8f9457012/Lib/_pyio.py#L1814), itself a child of [`IOBase`](https://github.com/python/cpython/blob/44c690112d96a81fe02433de7900a4f8f9457012/Lib/_pyio.py#L313) where [`readlines()`](https://github.com/python/cpython/blob/44c690112d96a81fe02433de7900a4f8f9457012/Lib/_pyio.py#L593) is defined. – AMC Feb 20 '20 at 17:28
  • Since neither of those two descendants appear to override the method, the definition in `IOBase` is used, which is simply `self._checkClosed(); for line in lines: self.write(line)`. (I've spent very little time looking at Python's inner workings, so I hope I was in the right place) – AMC Feb 20 '20 at 17:28