1

I have a dataframe to aggregate one column based on the rest of the other columns. I do not want to give all those rest of the columns in groupBy with comma separated as I have about 30 columns. Could somebody tell me how can I do it in a way that looks more readable.

right now, am doing - df.groupBy("c1","c2","c3","c4","c5","c6","c7","c8","c9","c10",....).agg(c11)

I want to know if there is any better way..

Thanks, John

John
  • 187
  • 1
  • 3
  • 12

2 Answers2

1

Specifying the columns is the clean way to do it but I believe you have quite a few options.

One of them is to go to Spark SQL and compose the query programmatically composing the string.

Another option could be to use the varargs : _* on a list of columns names, like this:

val cols = ...
df.groupBy( cols : _*).agg(...)
Chobeat
  • 3,445
  • 6
  • 41
  • 59
  • Thanks for the reply. Could you please tell me how should my cols look like ?. I used an array of strings and this doesn't work. – John Jan 20 '17 at 23:08
  • This answers explains it correctly: http://stackoverflow.com/questions/37524510/spark-dataframe-groupby-with-sequence-as-keys-arguments – Chobeat Jan 21 '17 at 11:46
  • That was exactly what I was looking for. Thnx chobeat..! – John Jan 22 '17 at 05:00
0

Use below steps:

get the columns as list

remove the columns needs to be aggregated from the columns list.

apply groupBy & agg.

**Ex**:
val seq = Seq((101, "abc", 24), (102, "cde", 24), (103, "efg", 22), (104, "ghi", 21), (105, "ijk", 20), (106, "klm", 19), (107, "mno", 18), (108, "pqr", 18), (109, "rst", 26), (110, "tuv", 27), (111, "pqr", 18), (112, "rst", 28), (113, "tuv", 29))
val df = sc.parallelize(seq).toDF("id", "name", "age")

val colsList = df.columns.toList
(colsList: List[String] = List(id, name, age))

val groupByColumns = colsList.slice(0, colsList.size-1)
(groupByColumns: List[String] = List(id, name))
val aggColumn = colsList.last
(aggColumn: String = age)

df.groupBy(groupByColumns.head, groupByColumns.tail:_*).agg(avg(aggColumn)).show
+---+----+--------+
| id|name|avg(age)|
+---+----+--------+
|105| ijk|    20.0|
|108| pqr|    18.0|
|112| rst|    28.0|
|104| ghi|    21.0|
|111| pqr|    18.0|
|113| tuv|    29.0|
|106| klm|    19.0|
|102| cde|    24.0|
|107| mno|    18.0|
|101| abc|    24.0|
|103| efg|    22.0|
|110| tuv|    27.0|
|109| rst|    26.0|
+---+----+--------+
Robert
  • 5,278
  • 43
  • 65
  • 115
mputha
  • 395
  • 5
  • 7