Iterate over large file with progress indicator in Python?

Question

I'm iterating over a large csv file and I'd like to print out some progress indicator. As I understand counting the number of lines would requires parsing all of the file for newline characters. So I cannot easily estimate progress with line number.

Is there anything else I can do to estimate the progress while reading in lines? Maybe I can go by size?

Piotr Czapla · Answer 1 · 2021-10-01T08:04:26.680

22

You can use tqdm with large files in the following way:

import os
import tqdm

with tqdm.tqdm(total=os.path.getsize(filename)) as pbar:
   with open(filename, "rb") as f:
      for l in f:
          pbar.update(len(l))
          ...

If you read a utf-8 file then your len(l) won't give you the exact number of bytes but it should be good enough.

edited Oct 01 '21 at 08:04

answered Aug 05 '18 at 11:13

Piotr Czapla

25,734
24
99
122

6

it has changed to: `with tqdm.tqdm(total=os.path.getsize(file)) as pbar:` – Malick Dec 20 '18 at 21:26
Maybe better would be to use: pbar.update(f.tell() - pbar.n) instead of: pbar.update(len(l)) – Martin Dec 06 '19 at 07:09
@Martin unfortunately, the f.tell() call fails: `OSError: telling position disabled by next() call` – fgiraldeau Nov 02 '22 at 18:49
@fgiraldeau try changing the `for l in f` to `while l := f.readline():`, it should help – Martin Nov 03 '22 at 08:27

score 8 · Answer 2 · edited May 23 '17 at 12:18

8

You can use os.path.getsize(filename) to get the size of your target file. Then as you read data from the file, you can calculate progress percentage using a simple formula currentBytesRead/filesize*100%. This calculation can be done at the end of every N lines.

For the actual progress bar, you take a look at Text Progress Bar in the Console

edited May 23 '17 at 12:18

Community

1
1

answered Jul 22 '14 at 14:48

Saimadhav Heblikar

702
5
15

1

How do I find `currentBytesRead` correctly representing actual bytes, while still reading correct (utf8) characters? – Gere Jul 22 '14 at 19:37
Only way would be to write a small amount of data to a tempfile in your chosen encoding, and then measure that tempfile size, calculate the character-to-byte ratio. I could be wrong, but this is the only way to ensure it works in a platform independent way, and at all times. This was also the reason, I did not mention it in the answer. It is a topic of its own. – Saimadhav Heblikar Jul 23 '14 at 06:30
Not sure, that writing gigabytes of data back would be faster than counting newlines. Maybe the file handle has some position indicator, though? – Gere Jul 23 '14 at 11:41
Not sure why you thought of writing "gigabytes of data". In my earlier comment I meant, write a small amount of data(say a single line) to a tempfile, with the required encoding. Then measure the size of the tempfile, to get character-to-bytes ratio. Then, while reading the large file, you can use filehandle.tell() to get a pointer to where you are currently in the file(in terms of number of characters). Then, multiply it with the ratio calculated earlier, to get the currentBytesRead value. – Saimadhav Heblikar Jul 23 '14 at 12:46
3

I thought `f.tell()` would be enough to get a byte position, but I noticed that if you iterate over a file, the `tell()` method is disabled (it reads chunks of 8k, but that's fine with me). I don't think character to bytes is constant enough to estimate for the rest of the file. Another difficulty is that I'm using `csv.reader` which complicates some of the suggestions here. I wish `tell` would work. – Gere Jul 23 '14 at 18:55

dmralev · Answer 3 · 2014-07-22T15:29:44.767

6

Please check this small (and useful) library named tqdm https://github.com/noamraph/tqdm You just wrap an iterator and cool progress meter shows as the loop executes.

The image says it all.

enter image description here

edited Jul 22 '14 at 15:29

answered Jul 22 '14 at 15:05

dmralev

85
1
4

7

It's indeed pretty cool and I will get that. It doesn't quite answer the question, but I like it. – Gere Jul 22 '14 at 19:28
Is there a way to get the line count with tqdm? – J'e Aug 11 '17 at 18:30

YohanK · Answer 4 · 2019-11-20T13:33:07.320

6

This is based on the @Piotr's answer for Python3

import os
import tqdm

with tqdm(total=os.path.getsize(filepath)) as pbar:
    with open(filepath) as file:
        for line in file:
            pbar.update(len(line.encode('utf-8')))
            ....
        file.close()

edited Nov 20 '19 at 13:33

answered Nov 20 '19 at 13:27

YohanK

495
1
6
12

Adel Ahmadyan · Answer 5 · 2014-07-22T15:05:53.390

5

You can use os.path.getsize (or os.stat) to get the size of your text file. Then whenever you parse a new line, compute the size of that line in bytes and use it as an indicator.

import os
fileName = r"c:\\somefile.log"
fileSize = os.path.getsize(fileName)

progress = 0
with open(fileName, 'r') as inputFile:
    for line in inputFile:
        progress = progress + len(line)
        progressPercent = (1.0*progress)/fileSize

#in the end, progress == fileSize

edited Jul 22 '14 at 15:05

answered Jul 22 '14 at 14:48

Adel Ahmadyan

164
8

Will this work with the size estimate? Like Unicode etc? – Gere Jul 22 '14 at 15:26
It does work. The `len` actually counts the number of bytes in the unicode string (not the number of characters). What is does actually is calling the `__len__` method in the class and returning that value. – Adel Ahmadyan Jul 22 '14 at 16:03
1

Hmm, but that only works because I didn't specify the encoding? Reading utf8 files with this gives incorrect `line`. If I have a UTF8 file and I specify the encoding, I get character counts again. – Gere Jul 22 '14 at 19:35

Iterate over large file with progress indicator in Python?

5 Answers5

Linked