I have a python program to analyse data and want to run it with Spark. I distribute data among workers and do some transformation on it. But finally I need to collect the results to the master node and run another function on it.
On driver program I have this code:
sc = SparkContext(conf=spark_conf)
sc.parallelize(group_list, 4) \
.map(function1, preservesPartitioning=True) \
.map(function2, preservesPartitioning=True) \
.map(function3, preservesPartitioning=True) \
.map(function4, preservesPartitioning=True) \
.map(function5, preservesPartitioning=True) \
.map(function6, preservesPartitioning=True) \
.map(function7, preservesPartitioning=True) \
.map(function8, preservesPartitioning=True) \
.map(function9, preservesPartitioning=True)
The last RDD which is made by function9 is a table with several rows and unique key. When master node collect all the last RDD from workers, they have repetitive rows in master node. I have to group by the last table and do some aggregation on some columns, so I have a final function which takes the last table and makes group by and aggregation on it. But I do not know how to pass the last RDD on the final function.
For example on worker1, I have this data:
key count average
B 3 0.2
x 2 0.1
y 5 1.2
On worker2, I have this data:
key count average
B 2 0.1
c 1 0.01
x 3 0.34
When master node receives all data from workers, it has:
key count average
B 3 0.2
x 2 0.1
y 5 1.2
B 2 0.1
c 1 0.01
x 3 0.34
You see that data have two B and two x key. I have to use another function in master node to group by on key column and calculate new average for the average column. I used reduce and give my final function to it, but it gives me error since it takes two arguments. Would you please guide me what spark action I can use to run my function on the last RDD?
Any guidance would be really appreciated.