I am running a large pandas merge join operation on a jupyter
notebook running on SageMaker
notebook instance ml.t3.large
i.e 8 gb
of memory.
import pandas as pd
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['A','B','C'],
....
})
df1.shape
(3000000, 10)
df2 = pd.DataFrame({
'ID': [],
'Name': [],
....
)}
df2.shape
(50000, 12)
# Join data
df_merge = pd.merge(
df1,
df2,
left_on = ['ID','Name'],
right_on = ['ID','Name'],
how = 'left'
)
When I run this operation, the kernel dies within a minute or so. How can I optimize this operation for memory efficiency?
The dtypes
are either int64, object, float64
.
Running df1.info(memory_usage = "deep")
shows
dtypes: float64(1), int64(6), object(12) memory usage: 3.1 GB