0

I am trying to merge 2 dataframes and for some reason it blows out of proportion in memory. those 2 dataframes are relatively large but nothing out of the ordinary. I am using a strong machine with 128GB of RAM.

x = b.merge(a,left_on='id1',right_on='id',how='left')

This is the output:

MemoryError: Unable to allocate 1.58 TiB for an array with shape (216639452968,) and data type int64

I am probably doing something wrong here but can't understand why it gets to a 1.6 TB of memory requirment.

Here is some info on the dataframes:

print(a.info())
print(a.memory_usage(deep=True))
print(b.info())
print(b.memory_usage(deep=True))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10092079 entries, 0 to 10092078
Data columns (total 2 columns):
 #   Column  Dtype         
---  ------  -----         
 0   id      object        
 1   date    datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 154.0+ MB
None

Index          128
id       654665935
date      80736632
dtype: int64

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000000 entries, 0 to 14999999
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   id1     object
 1   id2     object
dtypes: object(2)
memory usage: 228.9+ MB
None

Index          128
id1      965676606
id2      718661312
dtype: int64

Gal
  • 371
  • 2
  • 14
  • Can you provide the shape and datatypes of your dataframes? – Bigga Jun 01 '23 at 05:57
  • 1
    Because you have many matches between `id` and `id1` – Corralien Jun 01 '23 at 07:25
  • @Bigga, this can be seen from the second output I provided. a has 1~10M rows and 2 columns - string and datetime. b has 15M rows with two columns, str and str – Gal Jun 01 '23 at 14:31
  • @Corralien, why would it create 1 1.6TB df ? In the worst casethere will be another column in dataframe b with dates. – Gal Jun 01 '23 at 14:33
  • 2
    In the worst case, a join will result in N*M rows. Your N is 10^7 and your M is 1.5 * 10 ^7. The resulting df can have 1.5 * 10 ^ 14 rows. If each row used a single byte, the entire df would take 150 TB(10 ^ 12). – CarlosGDCJ Jun 01 '23 at 16:53
  • @CarlosGDCJ Thanks. So is pandas not trying to do the merge because of the worst possible case? I know for a fact that this will not be an N*M. The number of rows will stay. – Gal Jun 01 '23 at 19:48
  • Well, Pandas is doing its best (still trying to merge), but apparently the merged df has 216639452968 rows, which is not the worst case, but is bad enough. I don't think you can do this merge in pandas without changing the 2 dataframes (dropping duplicates or something), but [maybe you could try it in`dask`](https://stackoverflow.com/questions/37756991/best-way-to-join-two-large-datasets-in-pandas) – CarlosGDCJ Jun 01 '23 at 22:54

1 Answers1

1

A possible option would be that there are multiple rows with the same id in each data (a and b). If you have in "a" the id "123" 10 times and the same id "123" in "b" 5 times- the resulting dataframe would have 50 rows with id "123".

Make sure you don't have duplicate ids in the datasets. If you do- make sure you really need the duplicates (or maybe groupby(id).agg(...) to remove the duplicates in some way you'd like).

If my solution doesn't assist- please add some more information about the amount of unique values of the id column of each dataset.

Matan Bendak
  • 128
  • 6