4

I have a dataset with some categorical features. I am trying to apply exact same function on all of these categorical features in Spark framework. My first assumption was that I can parallelize operation of each feature with operation of other features. However I couldn't figure out is it possible or not (confused after reading this, this).

For example, assume that my dataset is as following:

feature1, feature2, feature3
blue,apple,snake
orange,orange,monkey
blue,orange,horse

I want to count the number of occurrences of each category for each feature, separately. For example for feature1 (blue=2, orange=1)

pdpi
  • 4,163
  • 2
  • 21
  • 30
  • You showed the input dataset. What about the output dataset? How would the output look like? – Jacek Laskowski May 30 '17 at 07:14
  • I want to find number of each categories in each feature. For example: for feature one output is an array like 2,1. But in here for simplicity I write the categories like red, blue. but in my problem I will change each category to bit representation. for example: in first feature I have 2 categories(Blue and orange). I will use 2 bit to represent it. so red will be 10 and orange will be 01. then I will sum column-wise and output will be 11 which means 1 for blue 1 for orange. accordingly, I cant use normal aggregation like count. I want to use UDF. can you please help me about how to write it? – Nooshin salek faramarzi May 30 '17 at 18:07

1 Answers1

1

TL;DR Spark SQL's DataFrames are not split per column but per rows so Spark processes group of rows per task (not columns) unless you split the source dataset using select-like operator.

If you want to:

count the number of occurrences of each category for each feature, separately

simply use groupBy and count (perhaps with join) or use windows (with window aggregate functions).

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • Thanks a lot for your response. Is there any solution to apply my own function after I used group by? I find a solution in https://spark.apache.org/docs/latest/sql-programming-guide.html (Type-Safe User-Defined Aggregate Functions). However, I am not sure that would it be working for my case and actually I couldn't really understand it. Would you please help me about it? – Nooshin salek faramarzi May 30 '17 at 02:55
  • Yes. You can use UDAF, but I'd rather stick to native aggregate functions first and only use UDAFs as the last resort. – Jacek Laskowski May 30 '17 at 07:14
  • I want to find number of each categories in each feature. For example: for feature one output is an array like 2,1. But in here for simplicity I write the categories like red, blue. but in my problem I will change each category to bit representation. for example: in first feature I have 2 categories(Blue and orange). I will use 2 bit to represent it. so red will be 10 and orange will be 01. then I will sum column-wise and output will be 11 which means 1 for blue 1 for orange. accordingly, I cant use normal aggregation like count. I want to use UDF. would you please help me? – Nooshin salek faramarzi May 30 '17 at 18:08