Problem Statement:
I have to perform SQL table like joins on multiple CSV files recursively. Example: I have files CSV1, CSV2, CSV3, .....CSVn
I need to perform joins(Inner/Outer/Left/Full) between two CSV at a time and the joined result with third CSV and so on till all CSV got merged.
What I have Tried:
I am using the pandas library merge method(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) to merge the data frames of CSV's file.
Code Snippet:
import pandas as pd
df1 = pd.read_csv(path_of_csv1)
df2 = pd.read_csv(path_of_csv2)
resultant_df = df1.merge(df2, left_on='left_csv_column_name', right_on='right_csv_column_name', how='inner')
.....
I am using the pandas version 1.1.0
and python version 3.8.5
Problem I am facing:
I am using Mac Book Pro with 8Gb Ram
and trying to merge the DataFrames inside and outside the docker container. For smaller CSV files of around 10Mb each, I am able to merge some files successfully but for some bigger CSV files, let's say of 50Mb each I am facing the Memory Leak issue. Before starting the merge operation My system has 3.5 GB of available ram(checked with docker stats <container_name>
) allocated to docker out of 6 GB and once Starting the merge process docker consumes the entire available RAM and merge process terminated in between with a kill-9 signal error.
I have tried merging them outside the container also. The same memory issue still persists and my process/terminal hangs in between.
PS: Pardon If wrote something wrong.
Any help would be much appreciated. I am totally stuck in this merging process.