My data consists of multiple columns and it looks something like this:
I would like to group the data for each column separately and count number of occurrences of each element, which I can achieve by doing this:
df.groupBy("Col-1").count()
df.groupBy("Col-2").count()
df.groupBy("Col-n").count()
However, if there are 1000 of columns, it my be time consuming. So I was trying to find the another way to do it:
At the moment what I have done so far:
def mapFxn1(x):
vals=[1] * len(x)
c=tuple(zip(list(x), vals))
return c
df_map=df.rdd.map(lambda x: mapFxn1(x))
mapFxn1 takes each row and transforms it into tuple of tuples: so basically row one would look like this: ((10, 1), (2, 1), (x, 1))
I am just wondering how one can used reduceByKey on df_map with the lambda x,y: x + y in order to achieve the grouping on each of columns and counting the occurrences of elements in each of the columns in single step.
Thank you in advance