1

I need a tqdm progress bar over a set of (possibly long) set of merge operations. In my application, I have a set of operations in cascade like the following

data = data.merge(get_data_source1(), on="id", how="left")\
           .merge(get_data_source2(), on="id", how="left")\
           ...
           .merge(get_data_sourceN(), on="id", how="left")

It is not relevant what the get_data_source<i> functions do, they pull the data from somewhere (for instance, from different files or different DBs) and they returns a DataFrame with an "id" column and that it takes a few seconds.

I would need a progress bar that goes with N. This is probably feasible encapsulating each merge operation within lambda functions and put them into an iterable, but it looks like an overengineered and hard to read solution if I try to think of it (please correct me if you think I'm wrong). Also, I'm aware that is possible to add a progress bar to each merge operation using the progress_apply function (like reported here), but that would generate several (N) short progress bar rather than a single one.

For the sake of emulating a working setup, let's consider this toy example

import pandas as pd
import numpy as np
import time

data = pd.DataFrame(np.random.randint(0,100,size=(100,3)), columns=["id","A", "B"])

def get_data(col):
    time.sleep(1.0)
    return pd.DataFrame(np.random.randint(0,100,size=(100,2)), columns=["id",col])

data.merge(get_data("C"), on="id", how="left")\
    .merge(get_data("D"), on="id", how="left")\
    .merge(get_data("E"), on="id", how="left")\
    .merge(get_data("F"), on="id", how="left")\
    .merge(get_data("G"), on="id", how="left")\
    .merge(get_data("H"), on="id", how="left")

What would the best way to approach the problem?

Matt07
  • 504
  • 7
  • 21

2 Answers2

1

I would suggest using functools.reduce.

Here's a snippet on some sample data frames, but it would work with any data frame iterable, just wrap it with tqdm.

import functools
import numpy as np
import pandas as pd
from tqdm.auto import tqdm

N = 10
columns = [["A", "B"], ["C"], ["D", "E", "F"]]
dfs = [
    pd.DataFrame(
        {
            "key": range(N),
            **{c: np.random.rand(N) for c in cols}
        }
    )
    for cols in columns
]
functools.reduce(lambda x, y: x.merge(y), tqdm(dfs[1:]), dfs[0])
0

You can create a list with your values that you want to apply the function get_data to, and iterate over this list with tqdm.

import pandas as pd
import numpy as np
import time
import tqdm


data = pd.DataFrame(np.random.randint(0,100,size=(100,3)), columns=["id","A", "B"])

def get_data(col):
    time.sleep(1.0)
    return pd.DataFrame(np.random.randint(0,100,size=(100,2)), columns=["id",col])

values = ["C","D","E","F","G","H"]

for i in tqdm.tqdm(values):
    data = data.merge(get_data(i), on="id", how="left")
data

You can either assign the merged dataframe to the data dataframe at each step like in the above example, or use the inplace parameter to avoid returning a new dataframe at each step.

EDIT: As all the get_data functions are different, I suggest as the question did to create an iterable with the functions. It is not required to use lambdas, as the example below shows:

functions = [get_data1,get_data2,get_data3]
for func in functions:
    data = func(param1,param2,param3)

This will iterate over all the functions of the list and execute them with the given parameters.

robinood
  • 1,138
  • 8
  • 16
  • Actually each ```get_data``` function are different. They are supposed to pull the data from different sources (e.g., different DBs). Sorry, I simplified the task to much, I'll edit the question – Matt07 Apr 11 '22 at 08:48
  • Would it be possible to have only one function taking as argument the filename (or DB access) for exmaple? I mean, is the accessed source the only differing part for each functio ? – robinood Apr 11 '22 at 09:06
  • No is not. They are fetching from different source in different ways (e.g. with different queries and different post-filters) – Matt07 Apr 11 '22 at 09:26
  • I guess then the only solution I see is to ceate an iterable with the functions as you stated in your question. Note that you do not need lambdas. See the edit I made to the answer! – robinood Apr 11 '22 at 11:33