0

I'm trying to use Spark to pull data from a Hive table and save it in a SQL Server table. An issue I am facing is that some columns are being pulled into the Dataframe with the BYTE datatype. I would like these to be pulled as TINYINT or INT if TINYINT is not possible.

The basic way I am doing it is this:

query = [SQL query]

val df = sql(query)

df.write.jdbc([connection info])

How can I apply a schema to this process that forces certain data types?

Cam
  • 2,026
  • 3
  • 25
  • 42

1 Answers1

2

To change the Spark DataFrame column type from one data type to another data type can be done using “withColumn()“, “cast function”, “selectExpr”, and SQL expression. Note that the type which you want to convert to should be a subclass of DataType class.

In Spark, we can change or cast DataFrame columns to only the following types as these are the subclasses of DataType class.

ArrayType, BinaryType, BooleanType, CalendarIntervalType, DateType, HiveStringType, MapType, NullType, NumericType, ObjectType, StringType, StructType, TimestampType

Below an example with sql expression:

val df2 = df.selectExpr("cast(year as int) year")

The detailed response of this question is below:

How can I change column types in Spark SQL's DataFrame?

Gabriele Franco
  • 879
  • 6
  • 10
  • Thank you - I ended up doing something similar to save some time... `val df = ss.sql(query).select([cols])`, and putting the casts in the column definitions inside the select statement. – Cam Jul 17 '20 at 14:10