30

from https://pypi.org/project/tqdm/:

import pandas as pd
import numpy as np
from tqdm import tqdm

df = pd.DataFrame(np.random.randint(0, 100, (100000, 6)))
tqdm.pandas(desc="my bar!")p`
df.progress_apply(lambda x: x**2)

I took this code and edited it so that I create a DataFrame from load_excel rather than using random numbers:

import pandas as pd
from tqdm import tqdm
import numpy as np

filename="huge_file.xlsx"
df = pd.DataFrame(pd.read_excel(filename))
tqdm.pandas()
df.progress_apply(lambda x: x**2)

This gave me an error, so I changed df.progress_apply to this:

df.progress_apply(lambda x: x)

Here is the final code:

import pandas as pd
from tqdm import tqdm
import numpy as np

filename="huge_file.xlsx"
df = pd.DataFrame(pd.read_excel(filename))
tqdm.pandas()
df.progress_apply(lambda x: x)

This results in a progress bar, but it doesn't actually show any progress, rather it loads the bar, and when the operation is done it jumps to 100%, defeating the purpose.

My question is this: How do I make this progress bar work?
What does the function inside of progress_apply actually do?
Is there a better approach? Maybe an alternative to tqdm?

Any help is greatly appreciated.

user2303336
  • 333
  • 1
  • 3
  • 7
  • 1
    tqdm is tracking the time it takes to move through an iterable. To have a status bar, you would need to understand how the backed of pandas loads files into a dataframe and write your own code to do so (if possible). If time to load is the concern why not just get a rough estimate on a few row lengths – DJK Sep 06 '18 at 18:07
  • would a spinning wheel be enough ? – rocksportrocker Sep 06 '18 at 18:35

6 Answers6

7

Will not work. pd.read_excel blocks until the file is read, and there is no way to get information from this function about its progress during execution.

It would work for read operations which you can do chunk wise, like

chunks = []
for chunk in pd.read_csv(..., chunksize=1000):
    update_progressbar()
    chunks.append(chunk)

But as far as I understand tqdm also needs the number of chunks in advance, so for a propper progress report you would need to read the full file first....

rocksportrocker
  • 7,251
  • 2
  • 31
  • 48
6

The following is a one-liner solution utilizing tqdm:

import pandas as pd
from tqdm import tqdm

df = pd.concat([chunk for chunk in tqdm(pd.read_csv(file_name, chunksize=1000), desc='Loading data')])

If you know the total lines to be loaded, you can add that information with the parameter total to the tqdm fuction, resulting in a percentage output.

3

This might help for people with similar problem. here you can get help

for example:

for i in tqdm(range(0,3), ncols = 100, desc ="Loading data.."): 
    df=pd.read_excel("some_file.xlsx",header=None)
    LC_data=pd.read_excel("some_file.xlsx",'Sheet1', header=None)
    FC_data=pd.read_excel("some_file.xlsx",'Shee2', header=None)    
print("------Loading is completed ------")
2

DISCLAIMER: This works only with xlrd engine and is not thoroughly tested!

How it works? We monkey-patch xlrd.xlsx.X12Sheet.own_process_stream method that is responsible to load sheets from file-like stream. We supply own stream, that contains our progress bar. Each sheet has it's own progress bar.

When we want the progress bar, we use load_with_progressbar() context manager and then do pd.read_excel('<FILE.xlsx>').

import xlrd
from tqdm import tqdm
from io import RawIOBase
from contextlib import contextmanager


class progress_reader(RawIOBase):
    def __init__(self, zf, bar):
        self.bar = bar
        self.zf = zf

    def readinto(self, b):
        n = self.zf.readinto(b)
        self.bar.update(n=n)
        return n


@contextmanager
def load_with_progressbar():

    def my_get_sheet(self, zf, *other, **kwargs):
        with tqdm(total=zf._orig_file_size) as bar:
            sheet = _tmp(self, progress_reader(zf, bar), **kwargs)
        return sheet

    _tmp = xlrd.xlsx.X12Sheet.own_process_stream

    try:
        xlrd.xlsx.X12Sheet.own_process_stream = my_get_sheet
        yield
    finally:
        xlrd.xlsx.X12Sheet.own_process_stream = _tmp


import pandas as pd

with load_with_progressbar():
    df = pd.read_excel('sample2.xlsx')

print(df)

Screenshot of progress bar:

enter image description here

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • This gives me the following error: 'ZipExtFile' object has no attribute '_orig_file_size' – Imad Aug 19 '20 at 15:05
1

The following is based on user's rocksportrocker excellent answer.

  • I am a Python beginner!
  • Below, please find my first version of using user rocksportrocker's recommendation.

import pandas as pd

print("Info: Loading starting.")

# https://stackoverflow.com/questions/52209290
temp = [];
myCounter = 1;
myChunksize = 10000;
# https://stackoverflow.com/questions/24251219/
for myChunk in pd.read_csv('YourFileName.csv', chunksize = myChunksize, low_memory = False):
    print('# of rows processed: ', myCounter*myChunksize)
    myCounter = myCounter + 1;
    temp.append(myChunk)
    
print("Info: Loading complete.")

# https://stackoverflow.com/questions/33642951
df = pd.concat(temp, ignore_index = True)
df.head()

enter image description here

Dr. Manuel Kuehner
  • 389
  • 1
  • 6
  • 16
0

This is based on suggestions from many of the previous responses. I use this to load a large TSV that contains comment lines (#). It updates the tqdm progress bar with a percentage value.

from tqdm import tqdm
import pandas as pd

        # Get number of lines in file.
        with open(file, 'r') as fp:
            lines = len(fp.readlines())
        # Read file in chunks, updating progress bar after each chunk.
        listdf = []
        with tqdm(total=lines) as bar:
            for chunk in pd.read_csv(file,chunksize=1000,comment='#',sep='\t'):
                listdf.append(chunk)
                bar.update(chunk.shape[0])

        df = pd.concat(listdf,ignore_index=True)
Barrel Roll
  • 811
  • 6
  • 4