11

I'm iterating over a large csv file and I'd like to print out some progress indicator. As I understand counting the number of lines would requires parsing all of the file for newline characters. So I cannot easily estimate progress with line number.

Is there anything else I can do to estimate the progress while reading in lines? Maybe I can go by size?

Gere
  • 12,075
  • 18
  • 62
  • 94

5 Answers5

22

You can use tqdm with large files in the following way:

import os
import tqdm

with tqdm.tqdm(total=os.path.getsize(filename)) as pbar:
   with open(filename, "rb") as f:
      for l in f:
          pbar.update(len(l))
          ...

If you read a utf-8 file then your len(l) won't give you the exact number of bytes but it should be good enough.

Piotr Czapla
  • 25,734
  • 24
  • 99
  • 122
8

You can use os.path.getsize(filename) to get the size of your target file. Then as you read data from the file, you can calculate progress percentage using a simple formula currentBytesRead/filesize*100%. This calculation can be done at the end of every N lines.

For the actual progress bar, you take a look at Text Progress Bar in the Console

Community
  • 1
  • 1
  • 1
    How do I find `currentBytesRead` correctly representing actual bytes, while still reading correct (utf8) characters? – Gere Jul 22 '14 at 19:37
  • Only way would be to write a small amount of data to a tempfile in your chosen encoding, and then measure that tempfile size, calculate the character-to-byte ratio. I could be wrong, but this is the only way to ensure it works in a platform independent way, and at all times. This was also the reason, I did not mention it in the answer. It is a topic of its own. – Saimadhav Heblikar Jul 23 '14 at 06:30
  • Not sure, that writing gigabytes of data back would be faster than counting newlines. Maybe the file handle has some position indicator, though? – Gere Jul 23 '14 at 11:41
  • Not sure why you thought of writing "gigabytes of data". In my earlier comment I meant, write a small amount of data(say a single line) to a tempfile, with the required encoding. Then measure the size of the tempfile, to get character-to-bytes ratio. Then, while reading the large file, you can use filehandle.tell() to get a pointer to where you are currently in the file(in terms of number of characters). Then, multiply it with the ratio calculated earlier, to get the currentBytesRead value. – Saimadhav Heblikar Jul 23 '14 at 12:46
  • 3
    I thought `f.tell()` would be enough to get a byte position, but I noticed that if you iterate over a file, the `tell()` method is disabled (it reads chunks of 8k, but that's fine with me). I don't think character to bytes is constant enough to estimate for the rest of the file. Another difficulty is that I'm using `csv.reader` which complicates some of the suggestions here. I wish `tell` would work. – Gere Jul 23 '14 at 18:55
6

Please check this small (and useful) library named tqdm https://github.com/noamraph/tqdm You just wrap an iterator and cool progress meter shows as the loop executes.

The image says it all.

enter image description here

dmralev
  • 85
  • 1
  • 4
6

This is based on the @Piotr's answer for Python3

import os
import tqdm

with tqdm(total=os.path.getsize(filepath)) as pbar:
    with open(filepath) as file:
        for line in file:
            pbar.update(len(line.encode('utf-8')))
            ....
        file.close()
YohanK
  • 495
  • 1
  • 6
  • 12
5

You can use os.path.getsize (or os.stat) to get the size of your text file. Then whenever you parse a new line, compute the size of that line in bytes and use it as an indicator.

import os
fileName = r"c:\\somefile.log"
fileSize = os.path.getsize(fileName)

progress = 0
with open(fileName, 'r') as inputFile:
    for line in inputFile:
        progress = progress + len(line)
        progressPercent = (1.0*progress)/fileSize

#in the end, progress == fileSize
  • Will this work with the size estimate? Like Unicode etc? – Gere Jul 22 '14 at 15:26
  • It does work. The `len` actually counts the number of bytes in the unicode string (not the number of characters). What is does actually is calling the `__len__` method in the class and returning that value. – Adel Ahmadyan Jul 22 '14 at 16:03
  • 1
    Hmm, but that only works because I didn't specify the encoding? Reading utf8 files with this gives incorrect `line`. If I have a UTF8 file and I specify the encoding, I get character counts again. – Gere Jul 22 '14 at 19:35