0

I have a Spark dataframe like that

+-----------------+---------------+----------+-----------+
|     column1     |    column2    | column3  |  column4  |
+-----------------+---------------+----------+-----------+
| a               | bbbbb         | cc       | >dddddddd |
| >aaaaaaaaaaaaaa | bb            | c        | dddd      |
| aa              | >bbbbbbbbbbbb | >ccccccc | ddddd     |
| aaaaa           | bbbb          | ccc      | d         |
+-----------------+---------------+----------+-----------+

I would like to find a length of the longest element in each column to obtain something like that

+---------+-----------+
| column  | maxLength |
+---------+-----------+
| column1 |        14 |
| column2 |        12 |
| column3 |         7 |
| column4 |         8 |
+---------+-----------+

I know how to do it column by column but don't know how to tell Spark - Do it for all columns .

I am using Scala Spark.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
pawelty
  • 1,000
  • 8
  • 27
  • 2
    Please check : https://stackoverflow.com/questions/54263293/in-spark-iterate-through-each-column-and-find-the-max-length – dassum Aug 12 '19 at 16:13

1 Answers1

1

You can use agg function max and length function to achieve it as

val x = df.columns.map(colName => {
  (colName, df.agg(max(length(col(colName)))).head().getAs[Integer](0))
}).toSeq.toDF("column", "maxLength")

Output:

+-------+---------+
|column |maxLength|
+-------+---------+
|column1|14       |
|column2|13       |
|column3|8        |
|column4|9        |
+-------+---------+

Other way is

df.select(df.columns.map(c => max(length(col(c))).as(s"max_${c}")): _*)

Output:

+-----------+-----------+-----------+-----------+
|max_column1|max_column2|max_column3|max_column4|
+-----------+-----------+-----------+-----------+
|14         |13         |8          |9          |
+-----------+-----------+-----------+-----------+
koiralo
  • 22,594
  • 6
  • 51
  • 72