2

I am new to spark scala and I have following situation as below I have a table "TEST_TABLE" on cluster(can be hive table) I am converting that to dataframe as:

scala> val testDF = spark.sql("select * from TEST_TABLE limit 10")

Now the DF can be viewed as

scala> testDF.show()

COL1|COL2|COL3  
----------------
abc|abcd|abcdef 
a|BCBDFG|qddfde 
MN|1234B678|sd

I want an output like below

COLUMN_NAME|MAX_LENGTH
       COL1|3
       COL2|8
       COL3|6

Is this feasible to do so in spark scala?

A8H1
  • 149
  • 4
  • 13

3 Answers3

8

Plain and simple:

import org.apache.spark.sql.functions._

val df = spark.table("TEST_TABLE")
df.select(df.columns.map(c => max(length(col(c)))): _*)
  • 2
    can you please convert this to PySpark as well? Thanks – BadBoy777 Feb 15 '20 at 04:39
  • @BadBoy777 https://stackoverflow.com/questions/64656561/pyspark-max-string-length-for-each-column-in-the-dataframe/64675496#64675496 I've added pyspark code for a different question. – Ronak Jain Dec 26 '22 at 05:33
4

You can try in the following way:

import org.apache.spark.sql.functions.{length, max}
import spark.implicits._

val df = Seq(("abc","abcd","abcdef"),
          ("a","BCBDFG","qddfde"),
          ("MN","1234B678","sd"),
          (null,"","sd")).toDF("COL1","COL2","COL3")
df.cache()
val output = df.columns.map(c => (c, df.agg(max(length(df(s"$c")))).as[Int].first())).toSeq.toDF("COLUMN_NAME", "MAX_LENGTH")
        +-----------+----------+
        |COLUMN_NAME|MAX_LENGTH|
        +-----------+----------+
        |       COL1|         3|
        |       COL2|         8|
        |       COL3|         6|
        +-----------+----------+

I think it's good idea to cache input dataframe df to make the computation faster.

Md Shihab Uddin
  • 541
  • 5
  • 13
  • Great Solution, However, My actual data has null and blanks so ".as[Int]" is throwing error, If I remove the ".as[Int]" then its asking for "Encoder" Error: java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row - field (class: "org.apache.spark.sql.Row", name: "_2") - root class: "scala.Tuple2" – A8H1 Jan 21 '19 at 17:45
  • 1
    Thanks, I'm trying to regenerate the issue you are facing. I've edited my answer and I've inserted row with null and blank but It is not showing any error. Can you please give sample rows so that i can regenerate the issue ? . – Md Shihab Uddin Jan 21 '19 at 18:17
  • Hi, I have simply change the .as[Int] to .as[String] that worked for now – A8H1 Jan 22 '19 at 06:43
  • 1
    Can you please accept the answer if it helps you ? :) – Md Shihab Uddin Jan 22 '19 at 06:47
3

Here is one more way to get the report of column names in vertical

scala> val df = Seq(("abc","abcd","abcdef"),("a","BCBDFG","qddfde"),("MN","1234B678","sd")).toDF("COL1","COL2","COL3")
df: org.apache.spark.sql.DataFrame = [COL1: string, COL2: string ... 1 more field]

scala> df.show(false)
+----+--------+------+
|COL1|COL2    |COL3  |
+----+--------+------+
|abc |abcd    |abcdef|
|a   |BCBDFG  |qddfde|
|MN  |1234B678|sd    |
+----+--------+------+

scala> val columns = df.columns
columns: Array[String] = Array(COL1, COL2, COL3)

scala> val df2 = columns.foldLeft(df) { (acc,x) => acc.withColumn(x,length(col(x))) }
df2: org.apache.spark.sql.DataFrame = [COL1: int, COL2: int ... 1 more field]

scala> df2.select( columns.map(x => max(col(x))):_* ).show(false)
+---------+---------+---------+
|max(COL1)|max(COL2)|max(COL3)|
+---------+---------+---------+
|3        |8        |6        |
+---------+---------+---------+


scala> df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).show(false)
+----+---+
|_1  |_2 |
+----+---+
|COL1|3  |
|COL2|8  |
|COL3|6  |
+----+---+


scala>

To get the results into Scala collections, say Map()

scala> val result = df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).as[(String,Int)].collect.toMap
result: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)

scala> result
res47: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)

scala>
stack0114106
  • 8,534
  • 3
  • 13
  • 38