how to perform cross join of two large pandas file in python

Question

I am trying to perform cross join of two pandas file with dimension 3383*192 and 5587*1487 in pandas and receive: Memory Error. can any one guide me how to perform cross join and get complete output in .csv file in python (either in batch processing in python or by using whole of datasets)

Try on of [these](https://stackoverflow.com/questions/13269890/cartesian-product-in-pandas) methods. — Erfan, Dec 12 '19 at 10:11
Maybe this can help. https://stackoverflow.com/questions/37756991/best-way-to-join-two-large-datasets-in-pandas You are trying to join two very big dataframes. Python probably would not be able to handle it. — Jason Chia, Dec 12 '19 at 10:14
try inner join, with constant scalar value in both dfs. you can try dask incase still the issue persists — Mohamed Thasin ah, Dec 12 '19 at 10:14

score 3 · Accepted Answer · answered Dec 12 '19 at 10:24

3

try this,

import pandas as pd
import numpy as np
import dask.dataframe as dd

sd = dd.from_pandas(df, npartitions=3)

df1 = pd.DataFrame(np.random.randint(0,100,size=(3383, 192)))
df2 = pd.DataFrame(np.random.randint(0,100,size=(5587, 1487)))
df1['key']=0
df2['key']=0


sd1 = dd.from_pandas(df1, npartitions=3)
sd2 = dd.from_pandas(df2, npartitions=3)


dd.merge(sd1, sd2, on=['key']).drop('key',1)

It's working in machine (8Gb Ram, Ubuntu Machine)

Explanation:

convert pandas dataframe to dask data frame
assign new column called key with constant value in both dfs
perform merge operation

answered Dec 12 '19 at 10:24

Mohamed Thasin ah

10,754
11
52
111

thanks for your help as i am able to see its number of columns using but not able to see its no. of rows. can you help me how i can able to see number of rows?? – Arjun Puri Dec 12 '19 at 11:24
@MohamedThasinah I am trying to do the same thing with dataframe of size (40000*50) with a 32 gb ram I get an error. Please refer to my question here: https://stackoverflow.com/questions/62839389/dask-dataframe-effecient-row-pair-generator – Saurabh Thakur Jul 13 '20 at 15:06
1

I tried using dask dataframes and whatever pandas was taking too long to do was fairly parallelized. It's worth checking out. – Moltres Dec 21 '22 at 05:08

Oleg O · Answer 2 · 2019-12-12T10:29:40.180

0

Downcast if possible prior to joining to reduce the volume, i.e.

df['something'] = pd.to_numeric(df['something'], downcast='something')
df['some_category'] = df['some_category'].astype('categorical')
df['some_time_column'] = pd.to_datetime(df['to_datetime'])

In my applications the reduction can amount to 30-60% of the initial volume, so the probability to hit the memory ceiling is much lower.

edited Dec 12 '19 at 10:29

answered Dec 12 '19 at 10:19

Oleg O

1,005
6
11

how to perform cross join of two large pandas file in python

2 Answers2