4

What is the difference between collect_list() and array() in spark using scala?

I see uses all over the place and the use cases are not clear to me to determine the difference.

vfrank66
  • 1,318
  • 19
  • 28

1 Answers1

22

Even though both array and collect_list return an ArrayType column, the two methods are very different.

Method array combines "column-wise" a number of columns into an array, whereas collect_list aggregates "row-wise" on a single column typically by group (or Window partition) into an array, as shown below:

import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(
  (1, "a", "b"),
  (1, "c", "d"),
  (2, "e", "f")
).toDF("c1", "c2", "c3")

df.
  withColumn("arr", array("c2", "c3")).
  show
// +---+---+---+------+
// | c1| c2| c3|   arr|
// +---+---+---+------+
// |  1|  a|  b|[a, b]|
// |  1|  c|  d|[c, d]|
// |  2|  e|  f|[e, f]|
// +---+---+---+------+

df.
  groupBy("c1").agg(collect_list("c2")).
  show
// +---+----------------+
// | c1|collect_list(c2)|
// +---+----------------+
// |  1|          [a, c]|
// |  2|             [e]|
// +---+----------------+
Shaido
  • 27,497
  • 23
  • 70
  • 73
Leo C
  • 22,006
  • 3
  • 26
  • 39