0

I have tried the following in spark-scala.

Logic:

If Code in Data1 = Code in Data2 the record has to be written to the output file. so based on this condition 4*2; 8 rows are getting written.

Is there any way to optimize the below piece of code to group data?

.join(Data1,
 col("Code") === col("Code"), "inner")
 .selectExpr("Id" ,
            "Date",
            "Code"
             ).as[OutData]
Data1
+---------------+------------
|Id             |Code     
+---------------+------------
|0839           |06869242986
|4395           |06869242986
|3796           |06869242986
|3592           |06869242986
+---------------+------------

Data2
+---------------+------------
|Date           |Code     
+---------------+------------
|202050         |06869242986
|202051         |06869242986
+---------------+------------

OutData
+---------------+-------+------------
|Id             |Date   |Code
+---------------+-------+------------
|0839           |202050 |06869242986
|4395           |202050 |06869242986
|3796           |202050 |06869242986
|3592           |202050 |06869242986
|0839           |202051 |06869242986
|4395           |202051 |06869242986
|3796           |202051 |06869242986
|3592           |202051 |06869242986
+---------------+-------+------------
tripleee
  • 175,061
  • 34
  • 275
  • 318
Rebe
  • 39
  • 1
  • 6
  • how can we use groupBy code along with join – Rebe Dec 30 '20 at 15:54
  • have you tried the solution to this post? https://stackoverflow.com/a/33327239/7206701 – Hoori M. Jan 02 '21 at 05:23
  • you could use an broadcast join on the dataframe that has less records and that would eliminate the shuffle during the join operation. Also make sure that the dataframe with less records is not that big that would crash your driver node. – Nikunj Kakadiya Jan 04 '21 at 06:08

0 Answers0