I'm learning Spark and I came across problem that I'm unable to overcome.. What I would like to achieve is to get number of elements with the same value for 2 arrays on the same positions. I'm able to get what I want via Python UDF but I'm wondering if there is a way using only Spark functions.
df_bits = sqlContext.createDataFrame([[[0, 1, 1, 0, 0, ],
[1, 1, 1, 0, 1, ],
]],['bits1', 'bits2'])
df_bits_with_result = df_bits.select('bits1', 'bits2', some_magic('bits1', 'bits2').show()
+--------------------+--------------------+---------------------------------+
|bits1 |bits2 |some_magic(bits1, bits2)|
+--------------------+--------------------+---------------------------------+
|[0, 1, 1, 0, 1, ] |[1, 1, 1, 0, 0, ] |3 |
+--------------------+--------------------+---------------------------------+
Why 3? bits1[1] == bits2[1] AND bits1[2] == bits2[2] AND bits1[3] == bits2[3]
I tried to play with rdd.reduce but with no luck.