I have 2 dataframes A(35 Million records) and B(30000 records)
A
|Text |
-------
| pqr |
-------
| xyz |
-------
B
|Title |
-------
| a |
-------
| b |
-------
| c |
-------
Below dataframe C is obtained after a crossjoin between A and B.
c = A.crossJoin(B, on = [A.text == B.Title)
C
|text | Title |
---------------
| pqr | a |
---------------
| pqr | b |
---------------
| pqr | c |
---------------
| xyz | a |
---------------
| xyz | b |
---------------
| xyz | c |
---------------
Both the columns above are of type String.
I am performing the below operation and it results in an Spark error(Job aborted due to stage failure)
display(c.withColumn("Contains", when(col('text').contains(col('Title')), 1).otherwise(0)).filter(col('Contains') == 0).distinct())
Any suggestions on how this join needs to be done to avoid the Spark error() on the resulting operations?