Jupyter kernel dies on SageMaker notebook instance when running join operation using pd.merge on large DataFrames

Question

I am running a large pandas merge join operation on a jupyter notebook running on SageMaker notebook instance ml.t3.large i.e 8 gb of memory.

import pandas as pd
    
    df1 = pd.DataFrame({ 
                        'ID': [1, 2, 3],
                        'Name': ['A','B','C'],
                        ....
                      })

    df1.shape
    (3000000, 10)
    
    df2 = pd.DataFrame({
                        'ID': [],
                        'Name': [],
                        ....
                      )}
    
    df2.shape
    (50000, 12)
    
                       
    
   
    # Join data
    
    df_merge = pd.merge(
                         df1,
                         df2,
                         left_on = ['ID','Name'],
                         right_on = ['ID','Name'],
                         how = 'left'
                       )

When I run this operation, the kernel dies within a minute or so. How can I optimize this operation for memory efficiency?

The dtypes are either int64, object, float64.

Running df1.info(memory_usage = "deep") shows

dtypes: float64(1), int64(6), object(12) memory usage: 3.1 GB

Register Sole · Answer 1 · 2023-02-03T06:18:57.787

1

For one, combine ID and Name into one single id column, so that Python/Pandas doesn't have to bother with concatenating the two. Then, index + join is much faster than merge, as shown in this answer.

Secondly, if possible, using python native operation tend to be much faster. For example, accessing a key of a python dictionary a['key1'] is O(1), not O(n).

So if you have 2 dictionaries as this:

from random import randint

a = {}
for id in range(3000000):
    a[id] = ['abc' for i in range(10)]

b = {}
for i in range(100000):
    id = randint(1,3000000)
    b[id] = ['def' for j in range(12)]

Left-joining like this should be a breeze. It took about 3 secs on my machine (16 GB memory).

# Result set
c = {}
null_value = [None for i in range(12)]
for key in a.keys():
    c[key] = a[key] + b.get(key, null_value)

edited Feb 03 '23 at 06:18

answered Feb 03 '23 at 01:30

Register Sole

3,206
1
14
22

I have 2 pandas DataFrames, are you suggesting `DataFrames` to `dict` to `left-join` and then back to `dfs` ? – kms Feb 03 '23 at 18:00
That decision very much depends on your requirement. I'm just giving alternatives. For example, I sometimes see people who always resort to data frames, when all the wanted to do is joining. In that case, pandas would be an overkill. But if the data is already in data frames for one reason or another, try the index and join method. Indexing is similar in concept with a dictionary's key. – Register Sole Feb 04 '23 at 00:03
The data is in a DataFrame. I tried the Concat, index and join method and am still running out of memory. – kms Feb 04 '23 at 04:17
I just noticed `df1` itself took up 3 GB. Is this expected? If the data is really 3 GB, then the merge result is also at least 3 GB, so you have consumed at least 6 GB total. In that case, maybe better use `df1 = ` rather than creating a new variable `df_merge = ...`. If this doesn't work, you may need a bigger instance type. – Register Sole Feb 05 '23 at 17:25

Jupyter kernel dies on SageMaker notebook instance when running join operation using pd.merge on large DataFrames

1 Answers1