339

I regularly perform pandas operations on data frames in excess of 15 million or so rows and I'd love to have access to a progress indicator for particular operations.

Does a text based progress indicator for pandas split-apply-combine operations exist?

For example, in something like:

df_users.groupby(['userID', 'requestDate']).apply(feature_rollup)

where feature_rollup is a somewhat involved function that take many DF columns and creates new user columns through various methods. These operations can take a while for large data frames so I'd like to know if it is possible to have text based output in an iPython notebook that updates me on the progress.

So far, I've tried canonical loop progress indicators for Python but they don't interact with pandas in any meaningful way.

I'm hoping there's something I've overlooked in the pandas library/documentation that allows one to know the progress of a split-apply-combine. A simple implementation would maybe look at the total number of data frame subsets upon which the apply function is working and report progress as the completed fraction of those subsets.

Is this perhaps something that needs to be added to the library?

cs95
  • 379,657
  • 97
  • 704
  • 746
cwharland
  • 6,275
  • 3
  • 22
  • 29
  • have u done a %prun (profile) on the code? sometimes you can do operations on the whole frame before you apply to eliminate bottlenecks – Jeff Sep 04 '13 at 01:30
  • @Jeff: you bet, I did that earlier to squeeze every last bit of performance out of it. The issue really comes down to the pseudo map-reduce boundary I'm working at since the rows are in the tens of millions so I don't expect super speed increases just want some feedback on the progress. – cwharland Sep 04 '13 at 04:56
  • Consider cythonising: http://pandas.pydata.org/pandas-docs/dev/enhancingperf.html#cython-writing-c-extensions-for-pandas – Andy Hayden Sep 04 '13 at 09:38
  • @AndyHayden - As I commented on your answer your implementation is quite good and adds a small amount of time to the overall job. I also cythonised three operations inside feature rollup which regained all of the time that is now dedicated reporting progress. So in the end I bet I'll have progress bars with a reduction in total processing time if I follow through with cython on the whole function. – cwharland Sep 04 '13 at 17:19

10 Answers10

699

Due to popular demand, I've added pandas support in tqdm (pip install "tqdm>=4.9.0"). Unlike the other answers, this will not noticeably slow pandas down -- here's an example for DataFrameGroupBy.progress_apply:

import pandas as pd
import numpy as np
from tqdm import tqdm
# from tqdm.auto import tqdm  # for notebooks

# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()

df = pd.DataFrame(np.random.randint(0, int(1e8), (10000, 1000)))
# Now you can use `progress_apply` instead of `apply`
df.groupby(0).progress_apply(lambda x: x**2)

In case you're interested in how this works (and how to modify it for your own callbacks), see the examples on GitHub, the full documentation on PyPI, or import the module and run help(tqdm). Other supported functions include map, applymap, aggregate, and transform.

EDIT


To directly answer the original question, replace:

df_users.groupby(['userID', 'requestDate']).apply(feature_rollup)

with:

from tqdm import tqdm
tqdm.pandas()
df_users.groupby(['userID', 'requestDate']).progress_apply(feature_rollup)

Note: tqdm <= v4.8: For versions of tqdm below 4.8, instead of tqdm.pandas() you had to do:

from tqdm import tqdm, tqdm_pandas
tqdm_pandas(tqdm())
casper.dcl
  • 13,035
  • 4
  • 31
  • 32
  • This looks great! Why do you have to do the "Create and register a new `tqdm` instance with `pandas`"? Could tqdm infer what it's being called by (and override this only if needed/unusual case)? – Andy Hayden Dec 19 '15 at 16:50
  • 25
    `tqdm` was actually created for just plain iterables originally: `from tqdm import tqdm; for i in tqdm( range(int(1e8)) ): pass` The pandas support was a recent hack I made :) – casper.dcl Dec 20 '15 at 20:19
  • 14
    Btw, if you use Jupyter notebooks, you can also use tqdm_notebooks to get a prettier bar. Together with pandas you'd currently need to instantiate it like `from tqdm import tqdm_notebook; tqdm_notebook().pandas(*args, **kwargs)` [see here](http://stackoverflow.com/questions/40476680/how-to-use-tqdm-with-pandas-in-a-jupyter-notebook) – grinsbaeckchen Jan 12 '17 at 19:47
  • 3
    As of version 4.8.1 - use tqdm.pandas() instead. https://github.com/tqdm/tqdm/commit/ad9abcc63330ad5d22fd8fca83dcec3e9d37afe6 – mork Apr 23 '17 at 07:32
  • 1
    Thanks, @mork is correct. We're working (slowly) towards `tqdm` v5 which makes things more modularised. – casper.dcl Apr 23 '17 at 12:23
  • Can it be suppressed to just single line update/progress? – Sushant Kulkarni May 18 '17 at 21:31
  • `tqdm` will automatically be single-line if your environment supports it (eg windows 10 command prompt, unix terminal, etc). – casper.dcl May 19 '17 at 09:54
  • 1
    For recent syntax recommendation, see tqdm Pandas documentation here: https://pypi.python.org/pypi/tqdm#pandas-integration – Manu CJ Nov 27 '17 at 10:11
  • 1
    Is it possible to do this while also having parallelisation? – ifly6 Oct 19 '18 at 18:21
  • @abcdaa, ifly6 - I added an answer below with a parallelization option of tqdm. The output looks great - you could see the progress per process :) – mork Jan 20 '19 at 11:47
  • Note that `pandas==0.25.0` requires `tqdm>=4.33.0` – casper.dcl Aug 09 '19 at 20:01
  • `tqdm.auto` doesn't work in my case, `from tqdm import tqdm` does seem to. – zabop Aug 04 '20 at 11:46
  • @casper.dcl we can't use tqdm with `pandas.DataFrame.filter()` right? `progress_apply` only applies a function over a dataframe. – stucash Sep 01 '20 at 12:40
  • @stucash currently supported: `apply`, `map`, `applymap`, `aggregate`, and `transform`. You could open a feature request for other functions such as `filter` at https://github.com/tqdm/tqdm/issues – casper.dcl Sep 01 '20 at 12:55
  • @casper.dcl thanks! I just tried `progress_apply` with a lambda returning boolean value on a pandas dataframe, it surprisingly gave me identical results as if I was using `filter`. Hopefully this maybe something useful for you as well! – stucash Sep 01 '20 at 13:38
  • In `tqdm==4.48.2` and `pandas==1.1.2`, `tqdm.pandas()` raises a `FutureWarning`: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version. Due to `from pandas import Panel`. – Janosh Mar 22 '21 at 12:36
  • 1
    @Casimir that warning disappears in `tqdm>=4.57.0` – casper.dcl Mar 22 '21 at 13:53
  • @casper.dcl Excellent. I've started using `tqdm.pandas()` a lot. Very useful. A question that popped up: Is it possible to pass arguments to the `tqdm` instance like `desc` or `disable`? – Janosh Mar 27 '21 at 08:02
  • 1
    this is so nice! What will be the life without answers like these! – sandeepsign Jun 04 '22 at 19:57
  • This is awesome! Does this also work to get a progress bar on reading files? e.g. `pd.read_csv` or `pd.read_parquet` ? – Geoffrey Negiar Jul 20 '22 at 13:55
  • Why does `progress_apply` sometimes run much faster than `apply`? I'm afraid some checks might be bypassed or something – catastrophic-failure Jan 25 '23 at 09:16
  • This changes my life – mchristos Mar 13 '23 at 11:49
31

In case you need support for how to use this in a Jupyter/ipython notebook, as I did, here's a helpful guide and source to relevant article:

from tqdm._tqdm_notebook import tqdm_notebook
import pandas as pd
tqdm_notebook.pandas()
df = pd.DataFrame(np.random.randint(0, int(1e8), (10000, 1000)))
df.groupby(0).progress_apply(lambda x: x**2)

Note the underscore in the import statement for _tqdm_notebook. As referenced article mentions, development is in late beta stage.

UPDATE as of 11/12/2021

I'm currently now using pandas==1.3.4 and tqdm==4.62.3, and I'm not sure which version tqdm authors implemented this change, but the above import statement is deprecated. Instead use:

 from tqdm.notebook import tqdm_notebook

UPDATE as of 02/01/2022 It's now possible to simplify import statements for .py an .ipynb files alike:

from tqdm.auto import tqdm
tqdm.pandas()

That should work as expected for both types of development environments, and should work on pandas dataframes or other tqdm-worthy iterables.

UPDATE as of 05/27/2022 If you're using a jupyter notebook on SageMaker, this combo works:

from tqdm import tqdm
from tqdm.gui import tqdm as tqdm_gui
tqdm.pandas(ncols=50)
Victor Vulovic
  • 521
  • 4
  • 11
20

To tweak Jeff's answer (and have this as a reuseable function).

def logged_apply(g, func, *args, **kwargs):
    step_percentage = 100. / len(g)
    import sys
    sys.stdout.write('apply progress:   0%')
    sys.stdout.flush()

    def logging_decorator(func):
        def wrapper(*args, **kwargs):
            progress = wrapper.count * step_percentage
            sys.stdout.write('\033[D \033[D' * 4 + format(progress, '3.0f') + '%')
            sys.stdout.flush()
            wrapper.count += 1
            return func(*args, **kwargs)
        wrapper.count = 0
        return wrapper

    logged_func = logging_decorator(func)
    res = g.apply(logged_func, *args, **kwargs)
    sys.stdout.write('\033[D \033[D' * 4 + format(100., '3.0f') + '%' + '\n')
    sys.stdout.flush()
    return res

Note: the apply progress percentage updates inline. If your function stdouts then this won't work.

In [11]: g = df_users.groupby(['userID', 'requestDate'])

In [12]: f = feature_rollup

In [13]: logged_apply(g, f)
apply progress: 100%
Out[13]: 
...

As usual you can add this to your groupby objects as a method:

from pandas.core.groupby import DataFrameGroupBy
DataFrameGroupBy.logged_apply = logged_apply

In [21]: g.logged_apply(f)
apply progress: 100%
Out[21]: 
...

As mentioned in the comments, this isn't a feature that core pandas would be interested in implementing. But python allows you to create these for many pandas objects/methods (doing so would be quite a bit of work... although you should be able to generalise this approach).

Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • I say "quite a bit of work", but you could probably rewrite this entire function as a (more general) decorator. – Andy Hayden Sep 04 '13 at 10:42
  • Thanks for expanding on Jeff's post. I've implemented both and the slowdown for each is quite minimal (added a total of 1.1 mins to an operation that took 27 mins to complete). This way I can view the progress and given the adhoc nature of these operations I think this is an acceptable slow down. – cwharland Sep 04 '13 at 17:17
  • Excellent, glad it helped. I was actually surprised at the slow down (when I tried an example), I expected it to be a lot worse. – Andy Hayden Sep 04 '13 at 20:33
  • 1
    To further add to the efficiency of the posted methods, I was being lazy about data import (pandas is just too good at handling messy csv!!) and a few of my entries (~1%) had completely whacked out insertions (think whole records inserted into single fields). Eliminating these cause a massive speed up in the feature rollup since there was no ambiguity about what to do during split-apply-combine operations. – cwharland Sep 04 '13 at 22:41
  • ha! it does make it easy, sanity checks always recommended before crunching the numbers. what did 27 minutes become? – Andy Hayden Sep 04 '13 at 22:52
  • 1
    I'm down to 8 mins...but I added somethings to the feature rollup (more features -> better AUC!). This 8 mins is per chunk (two chunks total right now) with each chunk in the neighborhood of 12 million rows. So yeah...16 mins to do hefty operations on 24 million rows using HDFStore (and there's nltk stuff in feature rollup). Quite good. Let's hope the internet doesn't judge me on the initial ignorance or ambivalence towards the messed up insertions =) – cwharland Sep 05 '13 at 05:10
  • I can't seem to get this code to work, I'm trying to use it with `.dropna()` on my dataset but I keep getting `logged_apply(master_dataset, dropna()) NameError: name 'dropna' is not defined` any ideas? – BML91 Jun 30 '14 at 12:01
  • @BML91 DataFrame's apply is row-wise so I don't *think* dropna can be written in terms of apply (needed to use this answer). Also, dropna is written in cython (not pure python) so doing something like this will be much slower. – Andy Hayden Jun 30 '14 at 15:04
  • @BML91 if master was a DataFrame groupby, you could do logged_apply(master_dataset, pd.DataFrame.dropna), but again this will be slower than, and equivalent to, df.dropna(). – Andy Hayden Jun 30 '14 at 15:05
15

For anyone who's looking to apply tqdm on their custom parallel pandas-apply code.

(I tried some of the libraries for parallelization over the years, but I never found a 100% parallelization solution, mainly for the apply function, and I always had to come back for my "manual" code.)

df_multi_core - this is the one you call. It accepts:

  1. Your df object
  2. The function name you'd like to call
  3. The subset of columns the function can be performed upon (helps reducing time / memory)
  4. The number of jobs to run in parallel (-1 or omit for all cores)
  5. Any other kwargs the df's function accepts (like "axis")

_df_split - this is an internal helper function that has to be positioned globally to the running module (Pool.map is "placement dependent"), otherwise I'd locate it internally..

here's the code from my gist (I'll add more pandas function tests there):

import pandas as pd
import numpy as np
import multiprocessing
from functools import partial

def _df_split(tup_arg, **kwargs):
    split_ind, df_split, df_f_name = tup_arg
    return (split_ind, getattr(df_split, df_f_name)(**kwargs))

def df_multi_core(df, df_f_name, subset=None, njobs=-1, **kwargs):
    if njobs == -1:
        njobs = multiprocessing.cpu_count()
    pool = multiprocessing.Pool(processes=njobs)

    try:
        splits = np.array_split(df[subset], njobs)
    except ValueError:
        splits = np.array_split(df, njobs)

    pool_data = [(split_ind, df_split, df_f_name) for split_ind, df_split in enumerate(splits)]
    results = pool.map(partial(_df_split, **kwargs), pool_data)
    pool.close()
    pool.join()
    results = sorted(results, key=lambda x:x[0])
    results = pd.concat([split[1] for split in results])
    return results

Bellow is a test code for a parallelized apply with tqdm "progress_apply".

from time import time
from tqdm import tqdm
tqdm.pandas()

if __name__ == '__main__': 
    sep = '-' * 50

    # tqdm progress_apply test      
    def apply_f(row):
        return row['c1'] + 0.1
    N = 1000000
    np.random.seed(0)
    df = pd.DataFrame({'c1': np.arange(N), 'c2': np.arange(N)})

    print('testing pandas apply on {}\n{}'.format(df.shape, sep))
    t1 = time()
    res = df.progress_apply(apply_f, axis=1)
    t2 = time()
    print('result random sample\n{}'.format(res.sample(n=3, random_state=0)))
    print('time for native implementation {}\n{}'.format(round(t2 - t1, 2), sep))

    t3 = time()
    # res = df_multi_core(df=df, df_f_name='apply', subset=['c1'], njobs=-1, func=apply_f, axis=1)
    res = df_multi_core(df=df, df_f_name='progress_apply', subset=['c1'], njobs=-1, func=apply_f, axis=1)
    t4 = time()
    print('result random sample\n{}'.format(res.sample(n=3, random_state=0)))
    print('time for multi core implementation {}\n{}'.format(round(t4 - t3, 2), sep))

In the output you can see 1 progress bar for running without parallelization, and per-core progress bars when running with parallelization. There is a slight hickup and sometimes the rest of the cores appear at once, but even then I think its usefull since you get the progress stats per core (it/sec and total records, for ex)

enter image description here

Thank you @abcdaa for this great library!

Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72
mork
  • 1,747
  • 21
  • 23
  • 2
    Thanks @mork - feel free to add to https://github.com/tqdm/tqdm/wiki/How-to-make-a-great-Progress-Bar or create a new page at https://github.com/tqdm/tqdm/wiki – casper.dcl Jan 21 '19 at 13:14
  • Thanks, but had to change these part: ```try: splits = np.array_split(df[subset], njobs) except ValueError: splits = np.array_split(df, njobs)``` because of KeyError exception instead of ValueError, change to Exception to handle all cases. – Marius Sep 19 '19 at 07:59
  • 1
    Thanks @mork - this answer should be higher. – Ian Apr 08 '20 at 14:32
10

Every answer here used pandas.DataFrame.groupby. If you want a progress bar on pandas.Series.apply without a groupby, here's how you can do it inside a jupyter-notebook:

from tqdm.notebook import tqdm
tqdm.pandas()


df['<applied-col-name>'] = df['<col-name>'].progress_apply(<your-manipulation-function>)
Naveen Reddy Marthala
  • 2,622
  • 4
  • 35
  • 67
  • I have to add this for anyone who wants to try this solution: You will need (`tqdm` version: `tqdm>=4.61.2`) otherwise, it won't work. Also, be sure to restart your kernal of `jupyternotebook` after installing the new version of tqdm. (e.g., I used `pip install tqdm==4.62.3`) – Dr Neo Nov 21 '21 at 08:15
5

You can easily do this with a decorator

from functools import wraps 

def logging_decorator(func):

    @wraps
    def wrapper(*args, **kwargs):
        wrapper.count += 1
        print "The function I modify has been called {0} times(s).".format(
              wrapper.count)
        func(*args, **kwargs)
    wrapper.count = 0
    return wrapper

modified_function = logging_decorator(feature_rollup)

then just use the modified_function (and change when you want it to print)

Thomas K
  • 39,200
  • 7
  • 84
  • 86
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • 1
    Obvious warning being this will slow down your function! You could even have it update with the progress http://stackoverflow.com/questions/5426546/in-python-how-to-change-text-after-its-printed e.g. count/len as percentage. – Andy Hayden Sep 04 '13 at 00:39
  • yep - you will have order(number of groups), so depending on what your bottleneck is this might make a difference – Jeff Sep 04 '13 at 01:27
  • perhaps the intuitive thing to do is wrap this in a `logged_apply(g, func)` function, where you'd have access to order, and could log from the beginning. – Andy Hayden Sep 04 '13 at 09:42
  • I did the above in my answer, also cheeky percentage update. Actually I couldn't get yours working... I think with the wraps bit. If your using it for the apply it's not so important anyway. – Andy Hayden Sep 04 '13 at 10:44
1

I've changed Jeff's answer, to include a total, so that you can track progress and a variable to just print every X iterations (this actually improves the performance by a lot, if the "print_at" is reasonably high)

def count_wrapper(func,total, print_at):

    def wrapper(*args):
        wrapper.count += 1
        if wrapper.count % wrapper.print_at == 0:
            clear_output()
            sys.stdout.write( "%d / %d"%(calc_time.count,calc_time.total) )
            sys.stdout.flush()
        return func(*args)
    wrapper.count = 0
    wrapper.total = total
    wrapper.print_at = print_at

    return wrapper

the clear_output() function is from

from IPython.core.display import clear_output

if not on IPython Andy Hayden's answer does that without it

1

For operations like merge, concat, join the progress bar can be shown by using Dask.

You can convert the Pandas DataFrames to Dask DataFrames. Then you can show Dask progress bar.

The code below shows simple example:

Create and convert Pandas DataFrames

import pandas as pd
import numpy as np
from tqdm import tqdm
import dask.dataframe as dd

n = 450000
maxa = 700

df1 = pd.DataFrame({'lkey': np.random.randint(0, maxa, n),'lvalue': np.random.randint(0,int(1e8),n)})
df2 = pd.DataFrame({'rkey': np.random.randint(0, maxa, n),'rvalue': np.random.randint(0, int(1e8),n)})

sd1 = dd.from_pandas(df1, npartitions=3)
sd2 = dd.from_pandas(df2, npartitions=3)

Merge with progress bar

from tqdm.dask import TqdmCallback
from dask.diagnostics import ProgressBar
ProgressBar().register()

with TqdmCallback(desc="compute"):
    sd1.merge(sd2, left_on='lkey', right_on='rkey').compute()

Dask is faster and requires less resources than Pandas for the same operation:

  • Pandas 74.7 ms
  • Dask 20.2 ms

For more details:

Note 1: I've tested this solution: https://stackoverflow.com/a/56257514/3921758 but it doesn't work for me. Doesn't measure the merge operation.

Note 2: I've checked "open request" for tqdm for Pandas like:

DataScientYst
  • 442
  • 2
  • 7
  • 19
0

For concat operations:

df = pd.concat(
    [
        get_data(f)
        for f in tqdm(files, total=len(files))
    ]
)

tqdm just returns an iterable.

Wesley Cheek
  • 1,058
  • 12
  • 22
0

In case you want to iterate over groups this does the trick

from tqdm import tqdm

groups = df.groupby(group_cols)
for keys, grouped_df in tqdm(groups, total=groups.ngroups)
    pass