2

I have following structure in RedShift after running describe on a table (all fields are Nullable):

a integer
b numeric(18)
c date
d char(3)
e smallint
f char(1)
g varchar(20)
h numeric(11,2)

All data is extracted to S3. Now want to load data into Spark Dataframe but need to create a proper schema for this table as well.

How would Spark schema look like for these fields?

Is this structure correct? (wondering specially about Numeric (11,2), Date, Char(1) fields)

val schema = StructType( 
    Array( 
        StructField("a", IntegerType, true), 
        StructField("b", IntegerType, true), 
        StructField("c", StringType, true),
        StructField("d", StringType, true),
        StructField("e", IntegerType, true),
        StructField("f", StringType, true),
        StructField("g", StringType, true),
        StructField("h", IntegerType, true)
    ) 
) 
Joe
  • 11,983
  • 31
  • 109
  • 183

1 Answers1

3

You should use :

  • DoubleType or DecimalType for float value (like NUMERIC(11,2)). Decimal is better in my opinion, as it operates on BigDecimals
  • LongType for very big numbers - like NUMERIC(18). Otherwise it will not be stored properly
  • DateType for dates - it can be stored as a string, but if you can, you should choose more meaningful type
T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
  • Thanks! Does the DateType need to follow specific format (or converted to specific format)? (e.g. MM/DD/YYYY HH:MM:SS) – Joe Feb 07 '17 at 16:01
  • @Joe You can easily convert string in any format to Date: http://stackoverflow.com/questions/40763796/convert-date-from-string-to-date-format-in-dataframes – T. Gawęda Feb 07 '17 at 16:06
  • At which moment we should use LongType instead of IntegerType? (e.g. when field in Database is defined as Numeric(5) or Numeric(10) or Numeric(15)...?) – Joe Feb 07 '17 at 18:48
  • Integer has maximum value of 2147483647. If you use higher numbers, then you should use LongType. `b` can have 18 digits, so it should be `LongType`, `h` also is too big, but also with fraction part - that's why I recommend DecimalType (or DoubleType if you use Doubles in code - Decimal is safier as it can holds much bigger numbers) – T. Gawęda Feb 07 '17 at 19:17
  • Thanks. I am getting this error when trying DecimalType (all other Types are OK): :84: error: type mismatch; found : org.apache.spark.sql.types.DecimalType.type required: org.apache.spark.sql.types.DataType StructField("aff_num",DecimalType, true) – Joe Feb 07 '17 at 19:32
  • @Joe Try `DecimalType(11,2)` – T. Gawęda Feb 07 '17 at 19:57