-1

I have 2 dataframes A(35 Million records) and B(30000 records)

A

|Text |
-------
| pqr  |
-------
| xyz  |
------- 

B

|Title |
-------
| a  |
-------
| b  |
-------
| c  |
------- 

Below dataframe C is obtained after a crossjoin between A and B.

c = A.crossJoin(B, on = [A.text == B.Title)

C

|text | Title |
---------------
| pqr  | a    |
---------------
| pqr  | b    |
---------------
| pqr  | c    |
---------------
| xyz  | a    |
---------------
| xyz  | b    |
---------------
| xyz  | c    |
---------------

Both the columns above are of type String.

I am performing the below operation and it results in an Spark error(Job aborted due to stage failure)

display(c.withColumn("Contains", when(col('text').contains(col('Title')), 1).otherwise(0)).filter(col('Contains') == 0).distinct())

Any suggestions on how this join needs to be done to avoid the Spark error() on the resulting operations?

Spark error message

gk1990
  • 1
  • 1
  • 2
  • Are you actually doing a cross join here? As in no join criteria at all? – Andrew May 29 '20 at 19:15
  • Looking at your edited question, if you are specifying join columns, you do not want a cross join. I'd suggest you test this with a **much** smaller amount of data. If Spark is doing a full cross join on those datasets, you will end up with, if my math is correct, over 1 trillion rows. – Andrew May 29 '20 at 19:34
  • can you paste the spark error as well in the question ? – QuickSilver May 30 '20 at 14:57
  • I am interested to see if any of the records in dataframe B Title column exist as a substring in dataframe A text column. That's the reason I am doing a crossjoin. The volume is very large. I will try to use a sample and see if it helps. thanks – gk1990 Jun 01 '20 at 14:30

2 Answers2

0

try using broadcast joins

from pyspark.sql.functions import broadcast
c = broadcast(A).crossJoin(B)

If you don't need and extra column "Contains" column thne you can just filter it as

display(c.filter(col("text").contains(col("Title"))).distinct())
n1tk
  • 2,406
  • 2
  • 21
  • 35
QuickSilver
  • 3,915
  • 2
  • 13
  • 29
0

Repartition the both dataframes before cross join , it will work.

  • 1
    Please can you [edit] to provide a small code example, so that others can confirm that your answer is correct. – Matthieu H May 02 '23 at 12:15