2

I have a dataframe which looks like this:

    [ID_number,cust_number,feature1,feature2,feature3,....]

Now I want to write a query that groups by ID_number and applies a User Defined Function on the subsets

    [cust_number,feature1,feature2,feature3,......]

grouped by each ID_number I need to apply Machine Learning algorithms on the features and store the weights somehow.

How do I do this using Apache Spark DataFrames(Using Scala) ?

marios
  • 8,874
  • 3
  • 38
  • 62
  • 3
    Possible duplicate of [How can I define and use a User-Defined Aggregate Function in Spark SQL?](http://stackoverflow.com/questions/32100973/how-can-i-define-and-use-a-user-defined-aggregate-function-in-spark-sql) – zero323 Jun 19 '16 at 18:26

1 Answers1

0

You can do something like this (pyspark).

schema_string = "cust_number,feature1,feature2,feature3"

fields = [StructField(field_name, StringType(), True) for field_name in schema_string.split(",")]

schema = StructType(fields) df = sql_context.createDataFrame(group_by_result_rdd, schema);

Note : here I am assuming all your features are of type String. Look at API docs for other data types

Neel Tiwari
  • 81
  • 1
  • 5