5

How can I find out the location of the file cursor when iterating over a file in Python3?

In Python 2.7 it's trivial, use tell(). In Python3 that same call throws an OSError:

Traceback (most recent call last):
  File "foo.py", line 113, in check_file
    pos = infile.tell()
OSError: telling position disabled by next() call

My use case is making a progress bar for reading large CSV files. Computing a total line count is too expensive and requires an extra pass. An approximate value is plenty useful, I don't care about buffers or other sources of noise, I want to know if it'll take 10 seconds or 10 minutes.

Simple code to reproduce the issue. It works as expected on Python 2.7, but throws on Python 3:

file_size = os.stat(path).st_size
with open(path, "r") as infile:
    reader = csv.reader(infile)
    for row in reader:
        pos = infile.tell()  # OSError: telling position disabled by next() call
        print("At byte {} of {}".format(pos, file_size))

This answer https://stackoverflow.com/a/29641787/321772 suggests that the problem is that the next() method disables tell() during iteration. Alternatives are to manually read line by line instead, but that code is inside the CSV module so I can't get at it. I also can't fathom what Python 3 gains by disabling tell().

So what is the preferred way to find out your byte offset while iterating over the lines of a file in Python 3?

Adam
  • 16,808
  • 7
  • 52
  • 98
  • 1
    you could use `enumerate` and return the line number. Like that you can give something useful to the user without having to traversing the file twice – Maarten Fabré Sep 25 '17 at 15:10
  • @MaartenFabré of course it's useful to print the line number, if only to show the script isn't stuck, and also it's all you can do if you don't know the length (i.e. reading from stdin). But it's far far far better to print "55% done, 2 minutes remaining" than "read 10,543,000 rows". – Adam Sep 25 '17 at 21:13

3 Answers3

6

The csv module just expects the first parameter of the reader call to be an iterator that returns one line on each next call. So you can just use a iterator wrapper than counts the characters. If you want the count to be accurate, you will have to open the file in binary mode. But in fact, this is fine because you will have no end of line conversion which is expected by the csv module.

So a possible wrapper is:

class SizedReader:
    def __init__(self, fd, encoding='utf-8'):
        self.fd = fd
        self.size = 0
        self.encoding = encoding   # specify encoding in constructor, with utf8 as default
    def __next__(self):
        line = next(self.fd)
        self.size += len(line)
        return line.decode(self.encoding)   # returns a decoded line (a true Python 3 string)
    def __iter__(self):
        return self

You code would then become:

file_size = os.stat(path).st_size
with open(path, "rb") as infile:
    szrdr = SizedReader(infile)
    reader = csv.reader(szrdr)
    for row in reader:
        pos = szrdr.size  # gives position at end of current line
        print("At byte {} of {}".format(pos, file_size))

The good news here is that you keep all the power of the csv module, including newlines in quoted fields...

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • This works. Though you don't need to worry about the encoding; just take what you're given, find its length, and return it. That way you don't change the decoding behavior. Also note that you need a `def next(self): return self.__next__()` so the same code works on both Python 2 and 3. – Adam Sep 25 '17 at 20:54
  • @Adam: the question was specifically about Python 3. If you do not decode what is read in binary mode, you will get bytes and not strings. The csv module behaves quite differently in Python2 and Python3, that's the reason why I have not tried to give compatible code. It is indeed possible but will be more complex. – Serge Ballesta Sep 25 '17 at 20:59
  • True, but the question doesn't open the file in binary mode. – Adam Sep 25 '17 at 21:08
  • @Adam:... my answer explains why the file should be opened in binary mode. If you don't and if the file is not in plain ASCII, the size will not be accurate. – Serge Ballesta Sep 25 '17 at 22:17
  • Well, it's good - but it seems to slow down reading the file quite a lot compared to using tell(). – xorsyst Jan 29 '19 at 17:23
0

If you are comfortable without the csv module in particular. You can do something like:

import os, csv

file_size = os.path.getsize('SampleCSV.csv')
pos = 0

with open('SampleCSV.csv', "r") as infile:
    for line in infile:
        pos += len(line) + 1    # 1 for newline character
        row = line.rstrip().split(',')
        print("At byte {} of {}".format(pos, file_size))

But this might not work in cases where fields themselves contain \".

Ex: 1,"Hey, you..",22:04 Though these can also be taken care of using regular expressions.

Siddhesh
  • 156
  • 3
0

As your csvfile is too large, there is also another solution according to the page you mentioned:

Using offset += len(line) instead of file.tell(). For example,

offset = 0
with open(path, mode) as file:
    for line in file:
        offset += len(line)
Zhou Hongbo
  • 1,297
  • 13
  • 25
  • The question suggests this alternative and explains why it doesn't work with the CSV module. The accepted answer is a way to make it work with CSV. – Adam Jan 14 '21 at 17:29