Merging two large dataframes on specific column with progress bar showing

Question

I have two large datasets, one 2.6 GB and the other 1GB. I have managed to read them both in as DataFrames.

Next I want to make a new DataFrame where I want to match both two datasets on a unique ID from both and discard the rows that don't have ID's that match between the two datasets.

I have tried the merge with small number of rows and i think it works, but I want to merge the whole thing, and also want to show a progress bar. I am using Jupyter Notebook with Python 3.

Matrikkel2019 is the unique ID in both datasets that are the same, and i want to keep the columns from both datasets, but only keep the values with the same matrikkel2019 ID

Code

from tqdm import tqdm_notebook

tqdm_notebook().pandas() 

merge = energydata.merge(dwellingData, left_on = "matrikkel2019", right_on="matrikkel2019").progress_apply()

I have tried using lambda x: x**2 inside the progress_apply function , but I get the error: TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int' and Invalid arguments error

Main problem is that the merge operation takes too long and my PC with 8Gb RAM is struggeling, so i dont know how long it will take or if it will finish.

@prp it works for some pandas operation, but the documentation doesn't clarify which operation:[link] (https://pypi.org/project/tqdm/). So i hope it works for merge, but my main problem is that the merge operation takes too long and it my pc is struggeling, so i dont know how long it will take. — DannyTG, Feb 21 '20 at 15:41

score 1 · Accepted Answer · answered Feb 21 '20 at 16:02

tqdm does support progress bars for pandas merging operations.

code taken from this question, here

import pandas as pd
from tqdm import tqdm

df1 = pd.DataFrame({'lkey': 1000*['a', 'b', 'c', 'd'],'lvalue': np.random.randint(0,int(1e8),4000)})
df2 = pd.DataFrame({'rkey': 1000*['a', 'b', 'c', 'd'],'rvalue': np.random.randint(0, int(1e8),4000)})

#this is how you activate the pandas features in tqdm
tqdm.pandas()
#call the progress_apply feature with a dummy lambda 
df1.merge(df2, left_on='lkey', right_on='rkey').progress_apply(lambda x: x)

For your code, along with the imports, it should just be:

tqdm.pandas()
merge = energydata.merge(dwellingData, left_on = "matrikkel2019", right_on="matrikkel2019").progress_apply(lambda x: x)

The above is not working for me with the latest pandas and tqdm. What was working is: stackoverflow.com/a/68936833/3921758 — DataScientYst, Aug 26 '21 at 11:47

Merging two large dataframes on specific column with progress bar showing

Code

1 Answers1

Linked