2

I'm reading a source that got descriptions longer then 256 chars. I want to write them to Redshift.

According to: https://github.com/databricks/spark-redshift#configuring-the-maximum-size-of-string-columns it is only possible in Scala.

According to this: https://github.com/databricks/spark-redshift/issues/137#issuecomment-165904691 it should be a workaround to specify the schema when creating the dataframe. I'm not able to get it to work.

How can I specify the schema with varchar(max)?

df = ...from source

schema = StructType([
    StructField('field1', StringType(), True),
    StructField('description', StringType(), True)
])

df = sqlContext.createDataFrame(df.rdd, schema)
user1217169
  • 389
  • 5
  • 12

1 Answers1

3

Redshift maxlength annotations are passed in format

{"maxlength":2048}

so this is the structure you should pass to StructField constructor:

from pyspark.sql.types import StructField, StringType

StructField("description", StringType(), metadata={"maxlength":2048})

or alias method:

from pyspark.sql.functions import col

col("description").alias("description", metadata={"maxlength":2048})

If you use PySpark 2.2 or earlier please check How to change column metadata in pyspark? for workaround.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • Setting this as the correct answer, even tho I did not get it to work yet, it answers my question. It should also work in python now as well, according to https://docs.databricks.com/spark/latest/data-sources/aws/amazon-redshift.html#additional-configuration-options (Databricks have recently closed-sourced the spark-redshift project) – user1217169 Sep 07 '18 at 10:17