Problem Statement: I am trying to write a Spark code in Scala which will load below mentioned two tiles (1. file with id and name 2. file with id and salary) from HDFS and join the same, and produce the (name.salary) values. And save the data in multiple tile group by salary (Means each file will have name of employees with same salary. File name has to include salary as well.
EmployeeName.csv
E01,Lokesh
E02,Bhupesh
E03,Amit
E04,Ratan
E05,Dinesh
E06,Pavan
07,Tejas
E08,Sheela
E09,Kumar
E10,Venkat
EmployeeSalary.csv
E01,50000
E02,50000
E03,45000
E04,45000
E05,50000
E06,45000
E07,50000
E08,10000
E09,10000
E10,10000
I tried the below, but it is not running. Looks like RDD functions within another RDD is not working. How else can I work this out?
val employeename = sc.textFile("/user/cloudera/EmployeeName").map(x => (x.split(",")(0),x.split(",")(1)))
val employeesalary = sc.textFile("/user/cloudera/EmployeeSalary").map(s => (s.split(",")(0),s.split(",")(1)))
val join = employeename.join(employeesalary).map({case(id,(name,salary)) => (salary,name)})
val group = join.groupByKey().map({case(key, groupvalues) => {
(key,groupvalues.toList)
}}).sortByKey()`enter code here`
val rdd1 = group.map{case (k,v) => k->sc.parallelize(v)}
rdd1.foreach{case (k,rdd) => rdd.saveAsTextFile("user/cloudera/"+k)}