0

I have a dataframe with the following schema:

subjectID, feature001, feature002, feature003, ..., feature299

Let's say my dataframe looks like:

123,0.23,0.54,0.35,...,0.26   
234,0.17,0.49,0.47,...,0.69

Now, what I want is:

subjectID, featureID, featureValue

The above dataframe would look like:

123,001,0.23
123,002,0.54
123,003,0.35
......
123,299,0.26
234,001,0.17
234,002,0.49
234,003,0.47
......
234,299,0.69

I know how to achieve it if i have only several columns:

newDF = df.select($"subjectID", expr("stack(3, 'feature001', 001, 'feature002', 002, 'feature003', 003) as (featureID, featureValue)"))

However, I am looking for a way to deal with 300 columns.

Syan
  • 23
  • 1
  • 2
  • 5

1 Answers1

0

You can build an array of struct with your columns and then use explode to transform them as rows:

import org.apache.spark.sql.functions.{explode, struct, lit, array, col}

// build an array of struct expressions from the feature columns
val columnExprs = df.columns
   .filter(_.startsWith("feature"))
   .map(name => struct(lit(name.replace("feature","")) as "id", col(name) as "value"))

// unpivot the DataFrame
val newDF = df.select($"subjectID", explode(array(columnExprs:_*)) as "feature")
              .select($"subjectID", 
                      $"feature.id" as "featureID", 
                      $"feature.value" as "featureValue") 
rluta
  • 6,717
  • 1
  • 19
  • 21