I am trying to perform cross join of two pandas file with dimension 3383*192 and 5587*1487 in pandas and receive: Memory Error. can any one guide me how to perform cross join and get complete output in .csv file in python (either in batch processing in python or by using whole of datasets)
Asked
Active
Viewed 1,418 times
2
-
Try on of [these](https://stackoverflow.com/questions/13269890/cartesian-product-in-pandas) methods. – Erfan Dec 12 '19 at 10:11
-
Please provide a minimal example with some code – ma3oun Dec 12 '19 at 10:12
-
Maybe this can help. https://stackoverflow.com/questions/37756991/best-way-to-join-two-large-datasets-in-pandas You are trying to join two very big dataframes. Python probably would not be able to handle it. – Jason Chia Dec 12 '19 at 10:14
-
try inner join, with constant scalar value in both dfs. you can try dask incase still the issue persists – Mohamed Thasin ah Dec 12 '19 at 10:14
2 Answers
3
try this,
import pandas as pd
import numpy as np
import dask.dataframe as dd
sd = dd.from_pandas(df, npartitions=3)
df1 = pd.DataFrame(np.random.randint(0,100,size=(3383, 192)))
df2 = pd.DataFrame(np.random.randint(0,100,size=(5587, 1487)))
df1['key']=0
df2['key']=0
sd1 = dd.from_pandas(df1, npartitions=3)
sd2 = dd.from_pandas(df2, npartitions=3)
dd.merge(sd1, sd2, on=['key']).drop('key',1)
It's working in machine (8Gb Ram, Ubuntu Machine)
Explanation:
- convert pandas dataframe to dask data frame
- assign new column called key with constant value in both dfs
- perform merge operation

Mohamed Thasin ah
- 10,754
- 11
- 52
- 111
-
thanks for your help as i am able to see its number of columns using but not able to see its no. of rows. can you help me how i can able to see number of rows?? – Arjun Puri Dec 12 '19 at 11:24
-
@MohamedThasinah I am trying to do the same thing with dataframe of size (40000*50) with a 32 gb ram I get an error. Please refer to my question here: https://stackoverflow.com/questions/62839389/dask-dataframe-effecient-row-pair-generator – Saurabh Thakur Jul 13 '20 at 15:06
-
1I tried using dask dataframes and whatever pandas was taking too long to do was fairly parallelized. It's worth checking out. – Moltres Dec 21 '22 at 05:08
0
Downcast if possible prior to joining to reduce the volume, i.e.
df['something'] = pd.to_numeric(df['something'], downcast='something')
df['some_category'] = df['some_category'].astype('categorical')
df['some_time_column'] = pd.to_datetime(df['to_datetime'])
In my applications the reduction can amount to 30-60% of the initial volume, so the probability to hit the memory ceiling is much lower.

Oleg O
- 1,005
- 6
- 11