How to apply same function on all of the columns of a dataset in parallel using Spark(Java)

Question

I have a dataset with some categorical features. I am trying to apply exact same function on all of these categorical features in Spark framework. My first assumption was that I can parallelize operation of each feature with operation of other features. However I couldn't figure out is it possible or not (confused after reading this, this).

For example, assume that my dataset is as following:

feature1, feature2, feature3
blue,apple,snake
orange,orange,monkey
blue,orange,horse

I want to count the number of occurrences of each category for each feature, separately. For example for feature1 (blue=2, orange=1)

You showed the input dataset. What about the output dataset? How would the output look like? — Jacek Laskowski, May 30 '17 at 07:14
I want to find number of each categories in each feature. For example: for feature one output is an array like 2,1. But in here for simplicity I write the categories like red, blue. but in my problem I will change each category to bit representation. for example: in first feature I have 2 categories(Blue and orange). I will use 2 bit to represent it. so red will be 10 and orange will be 01. then I will sum column-wise and output will be 11 which means 1 for blue 1 for orange. accordingly, I cant use normal aggregation like count. I want to use UDF. can you please help me about how to write it? — Nooshin salek faramarzi, May 30 '17 at 18:07

score 1 · Answer 1 · answered May 26 '17 at 04:51

1

TL;DR Spark SQL's DataFrames are not split per column but per rows so Spark processes group of rows per task (not columns) unless you split the source dataset using select-like operator.

If you want to:

count the number of occurrences of each category for each feature, separately

simply use groupBy and count (perhaps with join) or use windows (with window aggregate functions).

answered May 26 '17 at 04:51

Jacek Laskowski

72,696
27
242
420

Thanks a lot for your response. Is there any solution to apply my own function after I used group by? I find a solution in https://spark.apache.org/docs/latest/sql-programming-guide.html (Type-Safe User-Defined Aggregate Functions). However, I am not sure that would it be working for my case and actually I couldn't really understand it. Would you please help me about it? – Nooshin salek faramarzi May 30 '17 at 02:55
Yes. You can use UDAF, but I'd rather stick to native aggregate functions first and only use UDAFs as the last resort. – Jacek Laskowski May 30 '17 at 07:14
I want to find number of each categories in each feature. For example: for feature one output is an array like 2,1. But in here for simplicity I write the categories like red, blue. but in my problem I will change each category to bit representation. for example: in first feature I have 2 categories(Blue and orange). I will use 2 bit to represent it. so red will be 10 and orange will be 01. then I will sum column-wise and output will be 11 which means 1 for blue 1 for orange. accordingly, I cant use normal aggregation like count. I want to use UDF. would you please help me? – Nooshin salek faramarzi May 30 '17 at 18:08

How to apply same function on all of the columns of a dataset in parallel using Spark(Java)

1 Answers1