scala spark check fields dataType

Question

I just want to check a data frame's filed type, so I had to write this function;

import scala.refelect.runtime.universe.TypeTag

import org.apache.spark.sql.DataFrame

def colCheck[T:TypeTag](data:DataFrame,colName:String):Boolean={
data.schema.exists(f->f.name==colName&f.dataType.isInstanceOf[T]
}

eg:data.schema.head is StringType and name is col1

I do this used spark-shell

import org.apache.spark.sql.types.{IntegerType,StringType}

val data:DataFrame=spark.createDataFrame(List(("",1),("",2))).toDF("col1","col2")



data.schema.head.dataType.isInstanceOf[StringType]

> :Boolean =true  # write result


data.schema.head.dataType.isInstanceOf[IntegerType]

> :Boolean =false    # write result

colChek[IntegerType](data,"col1")

> :Boolean =true

my expected is

colChek[IntegerType](data,"col1")

> :Boolean =false

colChek[StringType](data,"col1")

> :Boolean =false

but I got this


colChek[IntegerType](data,"col1")

> :Boolean =true

what causes this to happen and how to fix it, thanks very much

version spark2.4.5 scala's version is 2.11.12

Could you specify what is the issue and what is the output you are expecting — Nikunj Kakadiya, Jan 23 '21 at 07:40

score 3 · Answer 1 · answered Jan 23 '21 at 08:58

You need to avoid type erasure as described in this answer:

import scala.reflect.{classTag, ClassTag}
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{DataType, IntegerType, StringType}

def colCheck[T: ClassTag](data: DataFrame, colName: String): Boolean = {
    data.schema.exists(
        f => f.name == colName & classTag[T].runtimeClass.isInstance(f.dataType)
    )
}

val data: DataFrame = spark.createDataFrame(List(("",1),("",2))).toDF("col1","col2")

colCheck[IntegerType](data, "col1")   // false
colCheck[IntegerType](data, "col2")   // true
colCheck[StringType](data, "col1")    // true
colCheck[StringType](data, "col2")    // false

it worked ,what if ArrayType check ,it'only worked for `colCheck[ArrayTyep](data,"cola")`,what if i need check specify type `colCheck[ArrayType(IntegerType,false)](data,"cola")` — SummersKing, Jan 25 '21 at 01:43

Jasper-M · Answer 2 · 2021-01-23T11:38:14.953

3

You should just pass in IntegerType as a value parameter instead of a type parameter, and then use == to compare instead of isInstanceOf.

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.DataType

def colCheck(data: DataFrame, colName: String, tpe: DataType): Boolean =
  Option(data.schema(colName)).exists(_.dataType == tpe)

colCheck(df, "col1", IntegerType)

edited Jan 23 '21 at 11:38

answered Jan 23 '21 at 11:29

Jasper-M

14,966
2
26
37

scala spark check fields dataType

2 Answers2