Data in my first RDD is like
1253
545553
12344896
1 2 1
1 43 2
1 46 1
1 53 2
Now the first 3 integers are some counters that I need to broadcast. After that all the lines have the same format like
1 2 1
1 43 2
I will map all those values after 3 counters to a new RDD after doing some computation with them in function. But I'm not able to understand how to separate those first 3 values and map the rest normally.
My Python code is like this
documents = sc.textFile("file.txt").map(lambda line: line.split(" "))
final_doc = documents.map(lambda x: (int(x[0]), function1(int(x[1]), int(x[2])))).reduceByKey(lambda x, y: x + " " + y)
It works only when first 3 values are not in the text file but with them it gives error.
I don't want to skip those first 3 values, but store them in 3 broadcast variables and then pass the remaining dataset in map function.
And yes the text file has to be in that format only. I cannot remove those 3 values/counters
Function1 is just doing some computation and returning the values.