In spark iterate through each column and find the max length

Question

I am new to spark scala and I have following situation as below I have a table "TEST_TABLE" on cluster(can be hive table) I am converting that to dataframe as:

scala> val testDF = spark.sql("select * from TEST_TABLE limit 10")

Now the DF can be viewed as

scala> testDF.show()

COL1|COL2|COL3  
----------------
abc|abcd|abcdef 
a|BCBDFG|qddfde 
MN|1234B678|sd

I want an output like below

COLUMN_NAME|MAX_LENGTH
       COL1|3
       COL2|8
       COL3|6

Is this feasible to do so in spark scala?

score 8 · Answer 1 · answered Jan 19 '19 at 01:17

8

Plain and simple:

import org.apache.spark.sql.functions._

val df = spark.table("TEST_TABLE")
df.select(df.columns.map(c => max(length(col(c)))): _*)

answered Jan 19 '19 at 01:17

user10936018

81
1

2

can you please convert this to PySpark as well? Thanks – BadBoy777 Feb 15 '20 at 04:39
@BadBoy777 https://stackoverflow.com/questions/64656561/pyspark-max-string-length-for-each-column-in-the-dataframe/64675496#64675496 I've added pyspark code for a different question. – Ronak Jain Dec 26 '22 at 05:33

Md Shihab Uddin · Answer 2 · 2019-01-21T18:12:30.350

4

You can try in the following way:

import org.apache.spark.sql.functions.{length, max}
import spark.implicits._

val df = Seq(("abc","abcd","abcdef"),
          ("a","BCBDFG","qddfde"),
          ("MN","1234B678","sd"),
          (null,"","sd")).toDF("COL1","COL2","COL3")
df.cache()
val output = df.columns.map(c => (c, df.agg(max(length(df(s"$c")))).as[Int].first())).toSeq.toDF("COLUMN_NAME", "MAX_LENGTH")
        +-----------+----------+
        |COLUMN_NAME|MAX_LENGTH|
        +-----------+----------+
        |       COL1|         3|
        |       COL2|         8|
        |       COL3|         6|
        +-----------+----------+

I think it's good idea to cache input dataframe df to make the computation faster.

edited Jan 21 '19 at 18:12

answered Jan 19 '19 at 11:47

Md Shihab Uddin

541
5
13

Great Solution, However, My actual data has null and blanks so ".as[Int]" is throwing error, If I remove the ".as[Int]" then its asking for "Encoder" Error: java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row - field (class: "org.apache.spark.sql.Row", name: "_2") - root class: "scala.Tuple2" – A8H1 Jan 21 '19 at 17:45
1

Thanks, I'm trying to regenerate the issue you are facing. I've edited my answer and I've inserted row with null and blank but It is not showing any error. Can you please give sample rows so that i can regenerate the issue ? . – Md Shihab Uddin Jan 21 '19 at 18:17
Hi, I have simply change the .as[Int] to .as[String] that worked for now – A8H1 Jan 22 '19 at 06:43
1

Can you please accept the answer if it helps you ? :) – Md Shihab Uddin Jan 22 '19 at 06:47

score 3 · Answer 3 · answered Jan 19 '19 at 13:15

Here is one more way to get the report of column names in vertical

scala> val df = Seq(("abc","abcd","abcdef"),("a","BCBDFG","qddfde"),("MN","1234B678","sd")).toDF("COL1","COL2","COL3")
df: org.apache.spark.sql.DataFrame = [COL1: string, COL2: string ... 1 more field]

scala> df.show(false)
+----+--------+------+
|COL1|COL2    |COL3  |
+----+--------+------+
|abc |abcd    |abcdef|
|a   |BCBDFG  |qddfde|
|MN  |1234B678|sd    |
+----+--------+------+

scala> val columns = df.columns
columns: Array[String] = Array(COL1, COL2, COL3)

scala> val df2 = columns.foldLeft(df) { (acc,x) => acc.withColumn(x,length(col(x))) }
df2: org.apache.spark.sql.DataFrame = [COL1: int, COL2: int ... 1 more field]

scala> df2.select( columns.map(x => max(col(x))):_* ).show(false)
+---------+---------+---------+
|max(COL1)|max(COL2)|max(COL3)|
+---------+---------+---------+
|3        |8        |6        |
+---------+---------+---------+


scala> df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).show(false)
+----+---+
|_1  |_2 |
+----+---+
|COL1|3  |
|COL2|8  |
|COL3|6  |
+----+---+


scala>

To get the results into Scala collections, say Map()

scala> val result = df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).as[(String,Int)].collect.toMap
result: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)

scala> result
res47: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)

scala>

In spark iterate through each column and find the max length

3 Answers3

Linked

Related