81

I can't see the tqdm progress bar when I use this code to iterate my opened file:

        with open(file_path, 'r') as f:
        for i, line in enumerate(tqdm(f)):
            if i >= start and i <= end:
                print("line #: %s" % i)
                for i in tqdm(range(0, line_size, batch_size)):
                    # pause if find a file naed pause at the currend dir
                    re_batch = {}
                    for j in range(batch_size):
                        re_batch[j] = re.search(line, last_span)

what's the right way to use tqdm here?

smci
  • 32,567
  • 20
  • 113
  • 146
Wei Wu
  • 1,023
  • 1
  • 9
  • 14

5 Answers5

109

You're on the right track. You're using tqdm correctly, but stop short of printing each line inside the loop when using tqdm. You'll also want to use tqdm on your first for loop and not on others, like so:

with open(file_path, 'r') as f:
    for i, line in enumerate(tqdm(f)):
        if i >= start and i <= end:
            for i in range(0, line_size, batch_size):
                # pause if find a file naed pause at the currend dir
                re_batch = {}
                for j in range(batch_size):
                    re_batch[j] = re.search(line, last_span)

Some notes on using enumerate and its usage in tqdm here.

Valentino Constantinou
  • 1,243
  • 1
  • 10
  • 12
30

I ran into this as well - tqdm is not displaying a progress bar, because the number of lines in the file object has not been provided.

The for loop will iterate over lines, reading until the next newline character is encountered.

In order to add the progress bar to tqdm, you will first need to scan the file and count the number of lines, then pass it to tqdm as the total

from tqdm import tqdm

num_lines = sum(1 for line in open('myfile.txt','r'))
with open('myfile.txt','r') as f:
    for line in tqdm(f, total=num_lines):
        print(line)
user1446308
  • 439
  • 4
  • 4
12

I'm trying to do the same thing on a file containing all Wikipedia articles. So I don't want to count the total lines before starting processing. Also it's a bz2 compressed file, so the len of the decompressed line overestimates the number of bytes read in that iteration, so...

with tqdm(total=Path(filepath).stat().st_size) as pbar:
    with bz2.open(filepath) as fin:
        for i, line in enumerate(fin):
            if not i % 1000:
                pbar.update(fin.tell() - pbar.n)
            # do something with the decompressed line
    # Debug-by-print to see the attributes of `pbar`: 
    # print(vars(pbar))

Thank you Yohan Kuanke for your deleted answer. If moderators undelete it you can crib mine.

hobs
  • 18,473
  • 10
  • 83
  • 106
  • 1
    This gives the right output but I found that calling `fin.tell()` / `pbar.update()` for every line of the file dramatically slowed down the iteration speed. Using an `if i % 100 == 0:` condition to update the pbar less frequently gave me a 10x speedup. – Ben Page Feb 18 '23 at 00:07
  • Excellent idea @BenPage! I'll add your optimization to the answer – hobs Feb 19 '23 at 19:42
  • You can't use this technique if you use the `csv` module to read your file (for example, with `csv_lines=csv.reader(fin)`). You get the error `OSError: telling position disabled by next() call` when you call `fin.tell()` – Eponymous Aug 31 '23 at 18:13
5

If you are reading from a very large file, try this approach:

from tqdm import tqdm
import os

file_size = os.path.getsize(filename)
lines_read= []
pbar = tqdm.tqdm(total=file_zize, unit="MB")
with open(filename, 'r', encoding='UTF-8') as file:
    while (line := file.readline()):
        lines_read.append(line)
        pbar.update(s.getsizeof(line)-sys.getsizeof('\n'))
pbar.close()

I left out the processing you might want to do before the append(line)

EDIT:

I changed len(line) to s.getsizeof(line)-sys.getsizeof('\n') as len(line) is not an accurate representation of how many bytes were actually read (see other posts about this). But even this is not 100% accurate as sys.getsizeof(line) is not the real length of bytes read but it's a "close enough" hack if the file is very large.

I did try using f.tell() instead and subtracting a file pos delta in the while loop but f.tell with non-binary files is very slow in Python 3.8.10.

As per the link below, I also tried using f.tell() with Python 3.10 but that is still very slow.

If anyone has a better strategy, please feel free to edit this answer but please provide some performance numbers before you do the edit. Remember that counting the # of lines prior to doing the loop is not acceptable for very large files and defeats the purpose of showing a progress bar altogether (try a 30Gb file with 300 million lines for example)

Why f.tell() is slow in Python when reading a file in non-binary mode https://bugs.python.org/issue11114

ejkitchen
  • 559
  • 6
  • 11
  • Thanks a lot , I"m confuzzing about how to use tqdm for out of Memory Big file – Iaoceot Oct 09 '22 at 03:56
  • If you import tqdm from tqdm, then remove one of the tqdm from the initial pbar statement-- i.e., pbar = tqdm(total=file_zize, unit="MB"). – Barrel Roll May 26 '23 at 16:47
2

In the case of reading a file with readlines(), following can be used:

from tqdm import tqdm
with open(filename) as f:
    sentences = tqdm(f.readlines(),unit='MB')

the unit='MB' can be changed to 'B' or 'KB' or 'GB' accordingly.

Ashwin Geet D'Sa
  • 6,346
  • 2
  • 31
  • 59