Specify pyspark dataframe schema with string longer than 256

Question

I'm reading a source that got descriptions longer then 256 chars. I want to write them to Redshift.

According to: https://github.com/databricks/spark-redshift#configuring-the-maximum-size-of-string-columns it is only possible in Scala.

According to this: https://github.com/databricks/spark-redshift/issues/137#issuecomment-165904691 it should be a workaround to specify the schema when creating the dataframe. I'm not able to get it to work.

How can I specify the schema with varchar(max)?

df = ...from source

schema = StructType([
    StructField('field1', StringType(), True),
    StructField('description', StringType(), True)
])

df = sqlContext.createDataFrame(df.rdd, schema)

score 3 · Accepted Answer · answered Sep 06 '18 at 11:25

3

Redshift maxlength annotations are passed in format

{"maxlength":2048}

so this is the structure you should pass to StructField constructor:

from pyspark.sql.types import StructField, StringType

StructField("description", StringType(), metadata={"maxlength":2048})

or alias method:

from pyspark.sql.functions import col

col("description").alias("description", metadata={"maxlength":2048})

If you use PySpark 2.2 or earlier please check How to change column metadata in pyspark? for workaround.

answered Sep 06 '18 at 11:25

zero323

322,348
103
959
935

Setting this as the correct answer, even tho I did not get it to work yet, it answers my question. It should also work in python now as well, according to https://docs.databricks.com/spark/latest/data-sources/aws/amazon-redshift.html#additional-configuration-options (Databricks have recently closed-sourced the spark-redshift project) – user1217169 Sep 07 '18 at 10:17

Specify pyspark dataframe schema with string longer than 256

1 Answers1