6

Assuming I have the following DataFrame:

+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
|  1|    null|  3|null|  11|
|  2|    null|  2| xxx|  22|
|  1|    null|  1| yyy|null|
|  2|    null|  7|null|  33|
|  1|    null| 12|null|null|
|  2|    null| 19|null|  77|
|  1|    null| 10| s13|null|
|  2|    null| 11| a23|null|
+---+--------+---+----+----+

here is the same sample DF with comments, sorted by grp and ord:

scala> df.orderBy("grp", "ord").show
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
|  1|    null|  1| yyy|null|
|  1|    null|  3|null|  11|   # grp:1 - last value for `col2` (11)
|  1|    null| 10| s13|null|   # grp:1 - last value for `col1` (s13)
|  1|    null| 12|null|null|   # grp:1 - last values for `null_col`, `ord`
|  2|    null|  2| xxx|  22|   
|  2|    null|  7|null|  33|   
|  2|    null| 11| a23|null|   # grp:2 - last value for `col1` (a23)
|  2|    null| 19|null|  77|   # grp:2 - last values for `null_col`, `ord`, `col2`
+---+--------+---+----+----+

I would like to compress it. I.e. to group it by column "grp" and for each group, sort rows by the "ord" column and take the last not null value in each column (if there is one).

+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
|  1|    null| 12| s13|  11|
|  2|    null| 19| a23|  77|
+---+--------+---+----+----+

I've seen the following similar questions:

but my real DataFrame has over 250 columns, so I need a solution where I don't have to specify all the columns explicitly.

I can't wrap my head around it...


MCVE: how to create a sample DataFrame:

  1. create local file "/tmp/data.txt" and copy and paste there a context of the DataFrame (as it's posted above)
  2. define function readSparkOutput():
  3. parse "/tmp/data.txt" to DataFrame:

    val df = readSparkOutput("file:///tmp/data.txt")
    

UPDATE: I think it should be similar to the following SQL:

SELECT
  grp, ord, null_col, col1, col2
FROM (
    SELECT
      grp,
      ord,
      FIRST(null_col) OVER (PARTITION BY grp ORDER BY ord DESC) as null_col,
      FIRST(col1) OVER (PARTITION BY grp ORDER BY ord DESC) as col1,
      FIRST(col2) OVER (PARTITION BY grp ORDER BY ord DESC) as col2,
      ROW_NUMBER() OVER (PARTITION BY grp ORDER BY ord DESC) as rn
    FROM table_name) as v
WHERE v.rn = 1;

how can we dynamically generate such a Spark query?

I tried the following simplified approach:

import org.apache.spark.sql.expressions.Window

val win = Window
  .partitionBy("grp")
  .orderBy($"ord".desc)

val cols = df.columns.map(c => first(c, ignoreNulls=true).over(win).as(c))

which produces:

scala> cols
res23: Array[org.apache.spark.sql.Column] = Array(first(grp, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `grp`, first(null_col, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `null_col`, first(ord, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `ord`, first(col1, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `col1`, first(col2, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `col2`)

but i couldn't pass it to df.select:

scala> df.select(cols.head, cols.tail: _*).show
<console>:34: error: no `: _*' annotation allowed here
(such annotations are only allowed in arguments to *-parameters)
       df.select(cols.head, cols.tail: _*).show

another attempt:

scala> df.select(cols.map(col): _*).show
<console>:34: error: type mismatch;
 found   : String => org.apache.spark.sql.Column
 required: org.apache.spark.sql.Column => ?
       df.select(cols.map(col): _*).show
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Well I guess you could write a function that accepts a DF and a group by column name, you could then dynamically create a select statement by getting the remaining columns from the DF schema that selects the max of each column. – Terry Dactyl Nov 05 '18 at 13:31
  • @TerryDactyl, thanks for the advice! Unfortunately `max()` wouldn't give me what I need. I need a last (in terms of the "ord" columns) not null value. And I still don't see how can I make it dynamically. I would appreciate if you could put it in the answer – MaxU - stand with Ukraine Nov 05 '18 at 13:38
  • @MaxU in pyspark, I *think* `df.groupBy("grp").agg(*[last(c, True).alias(c) for c in df.columns if c!= "grp"]).show()` gives you what you want. Don't know how to translate that into scala. The function `last` takes a second boolean argument `ignorenulls` which defaults to `False`. – pault Nov 05 '18 at 17:36
  • @pault, thank you for the hint! Unfortunately it ignores the order within each group, so it won't produce desired output. – MaxU - stand with Ukraine Nov 05 '18 at 17:50
  • The correct syntax to use `cols` here is the following : `df.select(cols: _*)` – eliasah Nov 13 '18 at 10:10
  • Unfortunately, this will not return what you want... – eliasah Nov 13 '18 at 10:11
  • 1
    @eliasah, thank you for your comment. Do you have an idea how to get desired data set dynamically? – MaxU - stand with Ukraine Nov 13 '18 at 10:15
  • I'm looking into it. I think that LeoC's approach is good. It's all about column selection thought. – eliasah Nov 13 '18 at 10:16
  • @MaxU I've added an answer about how I would do this. Hope it helps ! – eliasah Nov 13 '18 at 10:42
  • 1
    If you have a sec, please do consider weighing in on my post here: https://stackoverflow.com/questions/53927460/how-do-i-slice-or-filter-mutliindex-dataframe-levels thanks :) – cs95 Dec 26 '18 at 15:28
  • 1
    @coldspeed, sure, it’s a great job! :) – MaxU - stand with Ukraine Dec 26 '18 at 15:42
  • Thanks Max, much appreciated :) – cs95 Dec 26 '18 at 15:50

5 Answers5

3

Consider the following approach that applies Window function last(c, ignoreNulls=true) ordered by "ord" per "grp" to each of the selected columns; followed by a groupBy("grp") to fetch the first agg(colFcnMap) result:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val df0 = Seq(
  (1, 3, None, Some(11)),
  (2, 2, Some("aaa"), Some(22)),
  (1, 1, Some("s12"), None),
  (2, 7, None, Some(33)),
  (1, 12, None, None),
  (2, 19, None, Some(77)),
  (1, 10, Some("s13"), None),
  (2, 11, Some("a23"), None)
).toDF("grp", "ord", "col1", "col2")

val df = df0.withColumn("null_col", lit(null))

df.orderBy("grp", "ord").show
// +---+---+----+----+--------+
// |grp|ord|col1|col2|null_col|
// +---+---+----+----+--------+
// |  1|  1| s12|null|    null|
// |  1|  3|null|  11|    null|
// |  1| 10| s13|null|    null|
// |  1| 12|null|null|    null|
// |  2|  2| aaa|  22|    null|
// |  2|  7|null|  33|    null|
// |  2| 11| a23|null|    null|
// |  2| 19|null|  77|    null|
// +---+---+----+----+--------+

val win = Window.partitionBy("grp").orderBy("ord").
  rowsBetween(0, Window.unboundedFollowing)

val nonAggCols = Array("grp")
val cols = df.columns.diff(nonAggCols)  // Columns to be aggregated

val colFcnMap = cols.zip(Array.fill(cols.size)("first")).toMap
// colFcnMap: scala.collection.immutable.Map[String,String] =
//   Map(ord -> first, col1 -> first, col2 -> first, null_col -> first)

cols.foldLeft(df)((acc, c) =>
    acc.withColumn(c, last(c, ignoreNulls=true).over(win))
  ).
  groupBy("grp").agg(colFcnMap).
  select(col("grp") :: colFcnMap.toList.map{case (c, f) => col(s"$f($c)").as(c)}: _*).
  show
// +---+---+----+----+--------+
// |grp|ord|col1|col2|null_col|
// +---+---+----+----+--------+
// |  1| 12| s13|  11|    null|
// |  2| 19| a23|  77|    null|
// +---+---+----+----+--------+

Note that the final select is for stripping the function name (in this case first()) from the aggregated column names.

Leo C
  • 22,006
  • 3
  • 26
  • 39
  • Thank you for your help! In my real DF I have more than 250 columns with different names (they don’t start with “col”) and some of columns will contain only NULLs, like in my sample data set. Do you know how can I adapt your method in this case? PS the “ord” column should also be present in the output data set. – MaxU - stand with Ukraine Nov 06 '18 at 06:21
  • 1
    @MaxU, depending on the actual column selection criteria, there're various ways to come up with the selected columns. For example, if most of the columns are to be aggregated, one could assemble the list of columns that aren't to be aggregated and apply `diff` to exclude them, as shown in the expanded answer. – Leo C Nov 06 '18 at 13:45
  • we are getting closer, thank you! :) Somehow I'm getting different resulting data set (`'col1': [null, null], 'col2': [null, 77]`, other columns are OK). I couldn't reproduce your results neither with my file nor with the sample data set from your answer... – MaxU - stand with Ukraine Nov 06 '18 at 13:48
  • 1
    @MaxU, not sure why you're getting different result. I've just executed the exact code again with my sample dataset on a freshly restarted Spark Shell (v.2.2.1) and the result remains the same as that in my answer. – Leo C Nov 06 '18 at 14:09
3

I have worked something out, here is the code and output

import org.apache.spark.sql.functions._
import spark.implicits._

val df0 = Seq(
  (1, 3, None, Some(11)),
  (2, 2, Some("aaa"), Some(22)),
  (1, 1, Some("s12"), None),
  (2, 7, None, Some(33)),
  (1, 12, None, None),
  (2, 19, None, Some(77)),
  (1, 10, Some("s13"), None),
  (2, 11, Some("a23"), None)
).toDF("grp", "ord", "col1", "col2")

df0.show()

//+---+---+----+----+
//|grp|ord|col1|col2|
//+---+---+----+----+
//|  1|  3|null|  11|
//|  2|  2| aaa|  22|
//|  1|  1| s12|null|
//|  2|  7|null|  33|
//|  1| 12|null|null|
//|  2| 19|null|  77|
//|  1| 10| s13|null|
//|  2| 11| a23|null|
//+---+---+----+----+

Ordering the data on first 2 columns

val df1 = df0.select("grp", "ord", "col1", "col2").orderBy("grp", "ord")

df1.show()

//+---+---+----+----+
//|grp|ord|col1|col2|
//+---+---+----+----+
//|  1|  1| s12|null|
//|  1|  3|null|  11|
//|  1| 10| s13|null|
//|  1| 12|null|null|
//|  2|  2| aaa|  22|
//|  2|  7|null|  33|
//|  2| 11| a23|null|
//|  2| 19|null|  77|
//+---+---+----+----+

val df2 = df1.groupBy("grp").agg(max("ord").alias("ord"),collect_set("col1").alias("col1"),collect_set("col2").alias("col2"))

val df3 = df2.withColumn("new_col1",$"col1".apply(size($"col1").minus(1))).withColumn("new_col2",$"col2".apply(size($"col2").minus(1)))

df3.show()

//+---+---+----------+------------+--------+--------+
//|grp|ord|      col1|        col2|new_col1|new_col2|
//+---+---+----------+------------+--------+--------+
//|  1| 12|[s12, s13]|        [11]|     s13|      11|
//|  2| 19|[aaa, a23]|[33, 22, 77]|     a23|      77|
//+---+---+----------+------------+--------+--------+

You can drop the columns you don't need by using .drop("column_name")

Shahab Niaz
  • 170
  • 10
2

So here we are grouping by a and selecting the max of all other columns in the group:

scala> val df = List((1,2,11), (1,1,1), (2,1,4), (2,3,5)).toDF("a", "b", "c")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]

scala> val aggCols = df.schema.map(_.name).filter(_ != "a").map(colName => sum(col(colName)).alias(s"max_$colName"))
aggCols: Seq[org.apache.spark.sql.Column] = List(sum(b) AS `max_b`, sum(c) AS `max_c`)

scala> df.groupBy(col("a")).agg(aggCols.head, aggCols.tail: _*)
res0: org.apache.spark.sql.DataFrame = [a: int, max_b: bigint ... 1 more field]
Terry Dactyl
  • 1,839
  • 12
  • 21
  • Thank you! I tried your solution (i've changed the aggregation function: `sum()` --> `last()`), but it doesn't work as expected with NULL's. Could you please give it a try with my sample data set? – MaxU - stand with Ukraine Nov 05 '18 at 13:51
1

Here is your answer (and hopefully my bounty!!!)

scala> val df = spark.sparkContext.parallelize(List(
     | (1,null.asInstanceOf[String],3,null.asInstanceOf[String],new Integer(11)),
     | (2,null.asInstanceOf[String],2,new String("xxx"),new Integer(22)),
     | (1,null.asInstanceOf[String],1,new String("yyy"),null.asInstanceOf[Integer]),
     | (2,null.asInstanceOf[String],7,null.asInstanceOf[String],new Integer(33)),
     | (1,null.asInstanceOf[String],12,null.asInstanceOf[String],null.asInstanceOf[Integer]),
     | (2,null.asInstanceOf[String],19,null.asInstanceOf[String],new Integer(77)),
     | (1,null.asInstanceOf[String],10,new String("s13"),null.asInstanceOf[Integer]),
     | (2,null.asInstanceOf[String],11,new String("a23"),null.asInstanceOf[Integer]))).toDF("grp","null_col","ord","col1","col2")
df: org.apache.spark.sql.DataFrame = [grp: int, null_col: string ... 3 more fields]

scala> df.show
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
|  1|    null|  3|null|  11|
|  2|    null|  2| xxx|  22|
|  1|    null|  1| yyy|null|
|  2|    null|  7|null|  33|
|  1|    null| 12|null|null|
|  2|    null| 19|null|  77|
|  1|    null| 10| s13|null|
|  2|    null| 11| a23|null|
+---+--------+---+----+----+

//Create window specification

scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window

scala> val win = Window.partitionBy("grp").orderBy($"ord".desc)
win: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@71878833

//Use foldLeft with first over window specification over all columns and take distinct

scala> val result = df.columns.foldLeft(df)((df, colName) => df.withColumn(colName, first(colName, ignoreNulls=true).over(win).as(colName))).distinct
result: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [grp: int, null_col: string ... 3 more fields]

scala> result.show
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
|  1|    null| 12| s13|  11|
|  2|    null| 19| a23|  77|
+---+--------+---+----+----+

Hope this helps.

m-bhole
  • 1,189
  • 10
  • 21
1

I'd go with same approach like @LeoC, but I believe that there is no need to manipulate column names as string and I would go with a more spark-sql like answer.

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, first, last}

val win = Window.partitionBy("grp").orderBy(col("ord")).rowsBetween(0, Window.unboundedFollowing)

// In case there is more than one group column
val nonAggCols = Seq("grp")

// Select columns to aggregate on
val cols: Seq[String] = df.columns.diff(nonAggCols).toSeq

// Map over selection and apply fct
val aggregations: Seq[Column] = cols.map(c => first(col(c), ignoreNulls = true).as(c))

// I'd rather cache the following step as it might get expensive
val step1 = cols.foldLeft(df)((acc, c) => acc.withColumn(c, last(col(c), ignoreNulls = true).over(win))).cache

// Finally we can aggregate our results as followed
val results = step1.groupBy(nonAggCols.head, nonAggCols.tail: _*).agg(aggregations.head, aggregations.tail: _*)

results.show
// +---+--------+---+----+----+
// |grp|null_col|ord|col1|col2|
// +---+--------+---+----+----+
// |  1|    null| 12| s13|  11|
// |  2|    null| 19| a23|  77|
// +---+--------+---+----+----+

I hope this helps.

EDIT: The reason you are not getting the same results is because the reader that you are using isn't correct.

It interprets null from the file as a string and not a null; i.e :

scala> df.filter('col1.isNotNull).show
// +---+--------+---+----+----+
// |grp|null_col|ord|col1|col2|
// +---+--------+---+----+----+
// |  1|    null|  3|null|  11|
// |  2|    null|  2| xxx|  22|
// |  1|    null|  1| yyy|null|
// |  2|    null|  7|null|  33|
// |  1|    null| 12|null|null|
// |  2|    null| 19|null|  77|
// |  1|    null| 10| s13|null|
// |  2|    null| 11| a23|null|
// +---+--------+---+----+----+

Here is my version of readSparkOutput :

def readSparkOutput(filePath: String): org.apache.spark.sql.DataFrame = {
  val step1 = spark.read
    .option("header", "true")
    .option("inferSchema", "true")
    .option("delimiter", "|")
    .option("parserLib", "UNIVOCITY")
    .option("ignoreLeadingWhiteSpace", "true")
    .option("ignoreTrailingWhiteSpace", "true")
    .option("comment", "+")
    .csv(filePath)

  val step2 = step1.select(step1.columns.filterNot(_.startsWith("_c")).map(step1(_)): _*)

  val columns = step2.columns
  columns.foldLeft(step2)((acc, c) => acc.withColumn(c, when(col(c) =!= "null" or col(c).isNotNull, col(c))))
}
eliasah
  • 39,588
  • 11
  • 124
  • 154