First time Spark user. I created RDDs for two csv files (employees & dept). I would like to provide an output that counts the number of employees by department ID as well as identify the top two department names with the most employee IDs. "deptno" is my primary key, but I don't know how to join the two files together.
The employee file contains the following columns: [empno, ename, job, mgr, hiredate, sal, comm, deptno]
The dept file contains the following columns: [deptno, dname, location]
Here is what I've done so far:
`employees_rdd = sc.textFile("/FileStore/tables/Employee.csv")
employees_rdd.take(3)
header_e = employees_rdd.first()
employees1 = employees_rdd.filter(lambda row : row != header_e)
employees1.take(1)`
`dept_rdd = sc.textFile("/FileStore/tables/Dept.csv")
dept_rdd.take(3)
header_d = dept_rdd.first()
dept1 = dept_rdd.filter(lambda row : row != header_d)
dept1.take(1)`
`employees2 = employees1.map(lambda row : row.split(","))
employees_kv = employees2.map(lambda row : (row[7],1))
employees_kv.take(3)`
Receiving a syntax error on below:
employees_kv.reduceByKey(lambda x,y : x+y).takeOrdered(2,lambda (x,y): -1*y)
Any assistance is greatly appreciated.