Proper Schema for Spark (when loading data into Dataframe)

Question

I have following structure in RedShift after running describe on a table (all fields are Nullable):

a integer
b numeric(18)
c date
d char(3)
e smallint
f char(1)
g varchar(20)
h numeric(11,2)

All data is extracted to S3. Now want to load data into Spark Dataframe but need to create a proper schema for this table as well.

How would Spark schema look like for these fields?

Is this structure correct? (wondering specially about Numeric (11,2), Date, Char(1) fields)

val schema = StructType( 
    Array( 
        StructField("a", IntegerType, true), 
        StructField("b", IntegerType, true), 
        StructField("c", StringType, true),
        StructField("d", StringType, true),
        StructField("e", IntegerType, true),
        StructField("f", StringType, true),
        StructField("g", StringType, true),
        StructField("h", IntegerType, true)
    ) 
)

score 3 · Accepted Answer · answered Feb 07 '17 at 15:56

3

You should use :

DoubleType or DecimalType for float value (like NUMERIC(11,2)). Decimal is better in my opinion, as it operates on BigDecimals
LongType for very big numbers - like NUMERIC(18). Otherwise it will not be stored properly
DateType for dates - it can be stored as a string, but if you can, you should choose more meaningful type

answered Feb 07 '17 at 15:56

T. Gawęda

15,706
4
46
61

Thanks! Does the DateType need to follow specific format (or converted to specific format)? (e.g. MM/DD/YYYY HH:MM:SS) – Joe Feb 07 '17 at 16:01
@Joe You can easily convert string in any format to Date: http://stackoverflow.com/questions/40763796/convert-date-from-string-to-date-format-in-dataframes – T. Gawęda Feb 07 '17 at 16:06
At which moment we should use LongType instead of IntegerType? (e.g. when field in Database is defined as Numeric(5) or Numeric(10) or Numeric(15)...?) – Joe Feb 07 '17 at 18:48
Integer has maximum value of 2147483647. If you use higher numbers, then you should use LongType. `b` can have 18 digits, so it should be `LongType`, `h` also is too big, but also with fraction part - that's why I recommend DecimalType (or DoubleType if you use Doubles in code - Decimal is safier as it can holds much bigger numbers) – T. Gawęda Feb 07 '17 at 19:17
Thanks. I am getting this error when trying DecimalType (all other Types are OK): :84: error: type mismatch; found : org.apache.spark.sql.types.DecimalType.type required: org.apache.spark.sql.types.DataType StructField("aff_num",DecimalType, true) – Joe Feb 07 '17 at 19:32
@Joe Try `DecimalType(11,2)` – T. Gawęda Feb 07 '17 at 19:57

Proper Schema for Spark (when loading data into Dataframe)

1 Answers1