1

I just want to check a data frame's filed type, so I had to write this function;

import scala.refelect.runtime.universe.TypeTag

import org.apache.spark.sql.DataFrame

def colCheck[T:TypeTag](data:DataFrame,colName:String):Boolean={
data.schema.exists(f->f.name==colName&f.dataType.isInstanceOf[T]
}

eg:data.schema.head is StringType and name is col1

I do this used spark-shell

import org.apache.spark.sql.types.{IntegerType,StringType}

val data:DataFrame=spark.createDataFrame(List(("",1),("",2))).toDF("col1","col2")



data.schema.head.dataType.isInstanceOf[StringType]

> :Boolean =true  # write result


data.schema.head.dataType.isInstanceOf[IntegerType]

> :Boolean =false    # write result

colChek[IntegerType](data,"col1")

> :Boolean =true

my expected is

colChek[IntegerType](data,"col1")

> :Boolean =false

colChek[StringType](data,"col1")

> :Boolean =false

but I got this


colChek[IntegerType](data,"col1")

> :Boolean =true

what causes this to happen and how to fix it, thanks very much

version spark2.4.5 scala's version is 2.11.12

SummersKing
  • 301
  • 1
  • 11

2 Answers2

3

You need to avoid type erasure as described in this answer:

import scala.reflect.{classTag, ClassTag}
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{DataType, IntegerType, StringType}

def colCheck[T: ClassTag](data: DataFrame, colName: String): Boolean = {
    data.schema.exists(
        f => f.name == colName & classTag[T].runtimeClass.isInstance(f.dataType)
    )
}

val data: DataFrame = spark.createDataFrame(List(("",1),("",2))).toDF("col1","col2")

colCheck[IntegerType](data, "col1")   // false
colCheck[IntegerType](data, "col2")   // true
colCheck[StringType](data, "col1")    // true
colCheck[StringType](data, "col2")    // false
mck
  • 40,932
  • 13
  • 35
  • 50
  • it worked ,what if ArrayType check ,it'only worked for `colCheck[ArrayTyep](data,"cola")`,what if i need check specify type `colCheck[ArrayType(IntegerType,false)](data,"cola")` – SummersKing Jan 25 '21 at 01:43
  • @SummersKing then you can use the other answer – mck Jan 25 '21 at 08:06
3

You should just pass in IntegerType as a value parameter instead of a type parameter, and then use == to compare instead of isInstanceOf.

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.DataType

def colCheck(data: DataFrame, colName: String, tpe: DataType): Boolean =
  Option(data.schema(colName)).exists(_.dataType == tpe)
colCheck(df, "col1", IntegerType)
Jasper-M
  • 14,966
  • 2
  • 26
  • 37