Python file.tell gives wrong value location

Question

I am trying to extract a number of locations from an existing file using Python. This is my current code for extracting the locations:

    self.fh = open( fileName , "r+")
    p = re.compile('regGen regPorSnip begin')
    for line in self.fh :
        if ( p.search(line) ):
            self.porSnipStartFPtr = self.fh.tell()
            sys.stdout.write("found regPorSnip")

This snippet is repeated a number of times (less the file open) with different search values, and seems to work: I get the correct messages, and the variables have values.

However, using the code below, the first write location is wrong, while subsequent write locations are correct:

    self.fh.seek(self.rstSnipStartFPtr,0)
    self.fh.write(str);
    sys.stdout.write("writing %s" % str )
    self.rstSnipStartFPtr = self.fh.tell()

I have read that passing certain read/readline options to fh can cause an erroneous tell value because of Python's tendency to 'read ahead'. One suggestion I saw for avoiding this is to read the whole file and rewrite it, which isn't a very appealing solution in my application.

If i change the first code snippet to:

  for line in self.fh.read() :
        if ( p.search(line) ):
            self.porSnipStartFPtr = self.fh.tell()
            sys.stdout.write("found regPorSnip")

Then it appears that self.fh.read() is returning only characters and not an entire line. The search never matches. The same appears to hold true for self.fh.readline().

My conclusion is that fh.tell only returns valid file locations when queried after a write operation.

Is there a way to extract the accurate file location when reading/searching?

Thanks.

FYI: http://stackoverflow.com/a/15935038/8747 – Robᵩ Nov 01 '13 at 16:23 — Robᵩ, Nov 01 '13 at 16:23

Tim Peters · Accepted Answer · 2013-11-01T17:07:11.170

The cause is (rather obscurely) explained in the docs for a file object's next() method:

When a file is used as an iterator, typically in a for loop (for example, for line in f: print line), the next() method is called repeatedly. This method returns the next input line, or raises StopIteration when EOF is hit. In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right. However, using seek() to reposition the file to an absolute position will flush the read-ahead buffer.

The values returned by tell() reflect how far this hidden read-ahead buffer has gotten, which will typically be up to a few thousand bytes beyond the characters your program has actually retrieved.

There's no portable way around this. If you need to mix tell() with reading lines, then use the file's readline() method instead. The tradeoff is that, in return for getting usable tell() results, iterating over a large file with readline() is typically significantly slower than using for line in file_object:.

Code

Concretely, change the loop to this:

line = self.fh.readline()
while line:
    if p.search(line):
        self.porSnipStartFPtr = self.fh.tell()
        sys.stdout.write("found regPorSnip")
    line = fh.readline()

I'm not sure that's what you really want, though: tell() is capturing the position of the start of the next line. If want the position of the start of the line, then you need to change the logic, like so:

pos = self.fh.tell()
line = self.fh.readline()
while line:
    if p.search(line):
        self.porSnipStartFPtr = pos
        sys.stdout.write("found regPorSnip")
    pos = self.fh.tell()
    line = fh.readline()

or do it with a "loop and a half":

while True:
    pos = self.fh.tell()
    line = self.fh.readline()
    if not line:
        break
    if p.search(line):
        self.porSnipStartFPtr = pos
        sys.stdout.write("found regPorSnip")

The file isn't huge so the penalty of using readline i don't believe will be an issue. the first option is the one that is the most appropriate. the start of the next line is OK. it seems that testing for empty file while reading the line can't be done when a file pointer is required. Thanks for the clarification. Greatly appreciated. — ktom, Nov 01 '13 at 18:48
Fantastic explanation thanks a ton! I also found this issue processing a large file but got around it by keeping an offset variable manually (offset += len(line)) instead of calling fh.tell(). This way you can keep the optimizations included with next() — dugloon, Apr 04 '17 at 16:44
@dugloon, that should work on Linuxy systems, but `tell()` results on text-mode files in Windows aren't generally simple byte offsets into the file. Python inherits this limitation from C. That's why the docs say "In text files (those opened without a b in the mode string), only seeks relative to the beginning of the file are allowed (the exception being seeking to the very file end with seek(0, 2)) and the only valid offset values are those returned from the f.tell(), or zero. Any other offset value produces undefined behaviour." — Tim Peters, Apr 04 '17 at 17:02
Thanks Tim! I forgot to add that piece - I am opening with mode="rb" — dugloon, Apr 05 '17 at 18:18

score 0 · Answer 2 · answered Nov 01 '13 at 16:24

0

I guess I dont understand the issue

>>> fh = open('test.txt')
>>> fh.tell()
0L
>>> fh.read(1)
'"'
>>> fh.tell()
1L
>>> fh.read(5)
'a" \n"'
>>> fh.tell()
7L

answered Nov 01 '13 at 16:24

Joran Beasley

110,522
12
160
179

1

The problem is actually due to using `for line in file_object:` - there's another layer of buffering then. – Tim Peters Nov 01 '13 at 16:27
ahh got ya ... ok Ill delete this – Joran Beasley Nov 01 '13 at 16:33
why `for line in file_object` (the pythonic way) is a problem? – iacopo Mar 14 '14 at 11:40

Python file.tell gives wrong value location

2 Answers2

Code

Linked