How to set n features of a spark Dataset using the VectorAssembler?

Question

I'm trying to run PCA on a matrix that contains n columns of unlabeled doubles. My code is:

    SparkSession spark = SparkSession
        .builder()
        .appName("JavaPCAExample")
        .getOrCreate();

    Dataset<Row> data = spark.read().format("csv")
        .option("sep", ",")
        .option("inferSchema", "true")
        .option("header", "False")
        .load("testInput/matrix.csv");

    PCAModel pca = new PCA()
//      .setInputCol("features")
//      .setOutputCol("pcaFeatures")
        .setK(3)
        .fit(data);

    Dataset<Row> result = pca.transform(data).select("pcaFeatures");
    result.show(true);

    spark.stop();

Running this results in a "java.lang.IllegalArgumentException: Field "features" does not exist." exception. I've found posts: How to merge multiple feature vectors in DataFrame?

How to work with Java Apache Spark MLlib when DataFrame has columns?

Which led me to the VectorAssembler docs here: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler

In each of those examples the labeled column headers are manually being added as features. I haven't been able to figure out how to use the VectorAssembler to turn all of my n unlabeled columns into features. Any insight would be appreciated. Thanks

score 0 · Accepted Answer · answered Apr 27 '21 at 19:52

found the .columns() function

    SparkSession spark = SparkSession
        .builder()
        .appName("JavaPCAExample")
        .getOrCreate();

    Dataset<Row> data = spark.read().format("csv")
        .option("sep", ",")
        .option("inferSchema", "true")
        .option("header", "False")
        .load("testInput/matrix.csv");
    
    
    VectorAssembler assembler = new VectorAssembler()
        .setInputCols(data.columns())
        .setOutputCol("features");

    Dataset<Row> output = assembler.transform(data);

    PCAModel pca = new PCA()
        .setInputCol("features")
        .setOutputCol("pcaFeatures")
        .setK(5)
        .fit(output);

How to set n features of a spark Dataset using the VectorAssembler?

1 Answers1