I have a query in Spark-Scala

Question

I have tried the following in spark-scala.

Logic:

If Code in Data1 = Code in Data2 the record has to be written to the output file. so based on this condition 4*2; 8 rows are getting written.

Is there any way to optimize the below piece of code to group data?

.join(Data1,
 col("Code") === col("Code"), "inner")
 .selectExpr("Id" ,
            "Date",
            "Code"
             ).as[OutData]

Data1
+---------------+------------
|Id             |Code     
+---------------+------------
|0839           |06869242986
|4395           |06869242986
|3796           |06869242986
|3592           |06869242986
+---------------+------------

Data2
+---------------+------------
|Date           |Code     
+---------------+------------
|202050         |06869242986
|202051         |06869242986
+---------------+------------

OutData
+---------------+-------+------------
|Id             |Date   |Code
+---------------+-------+------------
|0839           |202050 |06869242986
|4395           |202050 |06869242986
|3796           |202050 |06869242986
|3592           |202050 |06869242986
|0839           |202051 |06869242986
|4395           |202051 |06869242986
|3796           |202051 |06869242986
|3592           |202051 |06869242986
+---------------+-------+------------

have you tried the solution to this post? https://stackoverflow.com/a/33327239/7206701 — Hoori M., Jan 02 '21 at 05:23
you could use an broadcast join on the dataframe that has less records and that would eliminate the shuffle during the join operation. Also make sure that the dataframe with less records is not that big that would crash your driver node. — Nikunj Kakadiya, Jan 04 '21 at 06:08

I have a query in Spark-Scala

0 Answers0