I have two large datasets, one 2.6 GB and the other 1GB. I have managed to read them both in as DataFrames.
Next I want to make a new DataFrame where I want to match both two datasets on a unique ID from both and discard the rows that don't have ID's that match between the two datasets.
I have tried the merge with small number of rows and i think it works, but I want to merge the whole thing, and also want to show a progress bar. I am using Jupyter Notebook with Python 3.
Matrikkel2019 is the unique ID in both datasets that are the same, and i want to keep the columns from both datasets, but only keep the values with the same matrikkel2019 ID
Code
from tqdm import tqdm_notebook
tqdm_notebook().pandas()
merge = energydata.merge(dwellingData, left_on = "matrikkel2019", right_on="matrikkel2019").progress_apply()
I have tried using lambda x: x**2
inside the progress_apply
function , but I get the error: TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int' and Invalid arguments error
Main problem is that the merge operation takes too long and my PC with 8Gb RAM is struggeling, so i dont know how long it will take or if it will finish.