How to apply User Defined Function on each subset formed by a Group By Operation in Apache Spark?

Question

I have a dataframe which looks like this:

    [ID_number,cust_number,feature1,feature2,feature3,....]

Now I want to write a query that groups by ID_number and applies a User Defined Function on the subsets

    [cust_number,feature1,feature2,feature3,......]

grouped by each ID_number I need to apply Machine Learning algorithms on the features and store the weights somehow.

How do I do this using Apache Spark DataFrames(Using Scala) ?

Possible duplicate of [How can I define and use a User-Defined Aggregate Function in Spark SQL?](http://stackoverflow.com/questions/32100973/how-can-i-define-and-use-a-user-defined-aggregate-function-in-spark-sql) — zero323, Jun 19 '16 at 18:26

score 0 · Answer 1 · answered Jun 20 '16 at 19:29

You can do something like this (pyspark).

schema_string = "cust_number,feature1,feature2,feature3"

fields = [StructField(field_name, StringType(), True) for field_name in schema_string.split(",")]

schema = StructType(fields) df = sql_context.createDataFrame(group_by_result_rdd, schema);

Note : here I am assuming all your features are of type String. Look at API docs for other data types

1 Answers1