Pyspark crossjoin between 2 dataframes with millions of records

Question

I have 2 dataframes A(35 Million records) and B(30000 records)

A

|Text |
-------
| pqr  |
-------
| xyz  |
-------

B

|Title |
-------
| a  |
-------
| b  |
-------
| c  |
-------

Below dataframe C is obtained after a crossjoin between A and B.

c = A.crossJoin(B, on = [A.text == B.Title)

C

|text | Title |
---------------
| pqr  | a    |
---------------
| pqr  | b    |
---------------
| pqr  | c    |
---------------
| xyz  | a    |
---------------
| xyz  | b    |
---------------
| xyz  | c    |
---------------

Both the columns above are of type String.

I am performing the below operation and it results in an Spark error(Job aborted due to stage failure)

display(c.withColumn("Contains", when(col('text').contains(col('Title')), 1).otherwise(0)).filter(col('Contains') == 0).distinct())

Any suggestions on how this join needs to be done to avoid the Spark error() on the resulting operations?

Spark error message

Are you actually doing a cross join here? As in no join criteria at all? — Andrew, May 29 '20 at 19:15
Looking at your edited question, if you are specifying join columns, you do not want a cross join. I'd suggest you test this with a **much** smaller amount of data. If Spark is doing a full cross join on those datasets, you will end up with, if my math is correct, over 1 trillion rows. — Andrew, May 29 '20 at 19:34
I am interested to see if any of the records in dataframe B Title column exist as a substring in dataframe A text column. That's the reason I am doing a crossjoin. The volume is very large. I will try to use a sample and see if it helps. thanks — gk1990, Jun 01 '20 at 14:30

score 0 · Answer 1 · edited Mar 14 '22 at 18:22

0

try using broadcast joins

from pyspark.sql.functions import broadcast
c = broadcast(A).crossJoin(B)

If you don't need and extra column "Contains" column thne you can just filter it as

display(c.filter(col("text").contains(col("Title"))).distinct())

edited Mar 14 '22 at 18:22

n1tk

2,406
2
21
35

answered May 29 '20 at 18:49

QuickSilver

3,915
2
13
29

score 0 · Answer 2 · answered May 01 '23 at 11:25

0

Repartition the both dataframes before cross join , it will work.

answered May 01 '23 at 11:25

Aditya Balki

1

1

Please can you [edit] to provide a small code example, so that others can confirm that your answer is correct. – Matthieu H May 02 '23 at 12:15

Pyspark crossjoin between 2 dataframes with millions of records

2 Answers2

Linked