6

The use case for this would be creating multiple generators based on some file-object without any of them trampling each other's read state.

Originally I (thought I) had a working implementation using seek() and tell() where each generator was decorated by a meta-generator which maintained the file-handle position. This worked fine on things like StringIO, but failed on real files due the to read-ahead buffer mutilating the offset.

Using readline() or otherwise mocking the real file-object isn't viable as the reason for doing this was the excessively large files prompting a generator expression in the first place. So losing the read-ahead buffer isn't really a good option (as an aside, why was Python implemented this way in the first place? Shouldn't the buffer be like a cache and not actually exposed to the user? Proper encapsulation should have prevented this tell() issue in the first place...)

I then tried to use copy.copy, but that results in something like this: <closed file '<uninitialized file>', mode '<uninitialized file>' at 0x7f722ffda810>. Which appears unusable.

Does there exist an alternative way to copy? Is there a way to initialize a file-object? Or should I give up on this use case entirely because it is not possible in Python?

Community
  • 1
  • 1
ebolyen
  • 956
  • 2
  • 10
  • 25

1 Answers1

9

You are looking for itertools.tee.

from itertools import tee
with open("somefile.txt", "r") as fh:
    fh1, fh2, fh3 = tee(fh, 3)

Once you call tee, do not use the parent iterator again. The iterators returned from tee may be used freely and independently, however.

For file objects specifically (to keep file-specific methods like read), you can just open a file multiple times; each file object will maintain its own file pointer as it reads the file.

fh1, fh2, fh3 = [open("somefile.txt") for i in range(3)]

or, if you already have a file object fh:

fh1, fh2, fh3 = [open(fh.name) for i in range(3)]

This doesn't preserve an already advanced file pointer, but it's easy enough to jump ahead:

for x in fh1, fh2, fh3:
    x.seek(fh.tell())
chepner
  • 497,756
  • 71
  • 530
  • 681
  • This is a great answer/solution, unfortunately we lose the `read(bytes)` interface of a file-object which may be needed in the future. – ebolyen Oct 14 '14 at 19:18
  • I had just reached that solution myself actually, (`open(fh.name`), however won't `fh.tell()` still be lying to me in the event of a real file? To add more detail, this is for a user-library and there is the potential that the user would need to seek beyond some point before the library should begin parsing. – ebolyen Oct 14 '14 at 19:51
  • I'm not sure what you mean by lying. Suppose you have a file object `fh` and you've already read 5 lines. `fh1` et al., upon being created, will all be pointing at byte zero. The `for` loop that calls `seek` is just to get the new file objects to the point where `fh` is now, assuming nothing reads from `fh` in the meantime. – chepner Oct 14 '14 at 19:59
  • `fh` will be pointing at the read-ahead buffer which was the problem in our initial solution. Fortunately we seem to have agreed on a way forward, which is to just ignore the 'problem' and we can mock the filehandle keeping track of the actual offset if needed in the future. – ebolyen Oct 14 '14 at 20:54
  • @ebolyen what happens if I try to write to one or two of the ( i.e. fh1.write('text1') , fh2.wite('text2') ? – pippo1980 Sep 04 '22 at 15:57