-1

This question is same as the one posted here. It has an accepted answer for scala. But I need to implement the same in Java.

How to select a subset of fields from an array column in Spark?

import org.apache.spark.sql.Row

case class Record(id: String, size: Int)

val dropUseless = udf((xs: Seq[Row]) =>  xs.map{
  case Row(id: String, size: Int, _) => Record(id, size)
})

df.select(dropUseless($"subClasss"))

I have tried to implement the above in java but couldn't get it working. Appreciate any help. Thanks

this.spark.udf().register("dropUseless",
            (UDF1<Seq<Row>, Seq<Row>>) rows -> {
                Seq<Row> seq = JavaConversions
                    .asScalaIterator(
                        JavaConversions.seqAsJavaList(rows)
                            .stream()
                            .map((Row t) -> RowFactory.create(new Object[] {t.getAs("id"), t.getAs("size")})
                            ).iterator())
                    .toSeq();
                return seq;
            }, DataTypes.createStructType(Arrays.asList(
                DataTypes.createStructField("id", DataTypes.StringType, false),
                DataTypes.createStructField("size", DataTypes.IntegerType, true))
                )
            );
gbgunz
  • 11
  • 1

1 Answers1

0

If we suppose you have a Dataframe (df), you can use native SQL to extract a new Dataframe (ndf) which could contain the results that you want.

Try this :

df.registerTempTable("df");

Dataframe ndf = sqlContext.sql("SELECT ..... FROM df WHERE ...");
  • Thanks..Normally works, but this doesn't retain the original schema in case of nested array of struct fields...extact requirement in the orginal question here - https://stackoverflow.com/questions/36476358/how-to-select-a-subset-of-fields-from-an-array-column-in-spark – gbgunz Nov 24 '18 at 10:02