Error writing dataframe in redshift using pyspark with boolean columns

Question

In my script the write method of PySpark takes a data frame and writes it a Redshift, however in some dataframes there are boolean columns that return error stating that Redshift does not accept bit data type.

My question is because it says that what should be boolean is bit.

The code:

spark = (
    SparkSession.builder.appName("data_quality")
    .config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
    .config("spark.jars", "redshift-jdbc42-2.1.0.3.jar")
    .config("spark.hadoop.fs.s3a.access.key", "key")
    .config("spark.hadoop.fs.s3a.secret.key", "secret_key" )
    .config("spark.hadoop.fs.s3a.session.token","tokiem")
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .getOrCreate()
)

df = spark.createDataFrame(
    [
        (1, False),  # create your data here, be consistent in the types.
        (2, True),
    ],
    ["id", 'column_type_bool']  # add your column names here
)
df.show()
df.dtypes

df.write \
  .format("jdbc") \
  .option("url", f"jdbc:redshift://{url_db}:5439/{db_name}") \
  .option("driver", "com.amazon.redshift.jdbc42.Driver") \
  .option("dbtable", f"{schema}.{tab}") \
  .option("user", user_db) \
  .option("password", pw) \
  .option("tempdir", "s3a://path") \
  .mode("overwrite") \
  .save()

The Table:

root |-- namecolumn: boolean (nullable = true)

The error:

Py4JJavaError: An error occurred while calling o113.save.
: com.amazon.redshift.util.RedshiftException: ERROR: Column "nametable.namecolumn" has unsupported type "bit".
    at com.amazon.redshift.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2601)
    at com.amazon.redshift.core.v3.QueryExecutorImpl.processResultsOnThread(QueryExecutorImpl.java:2269)
    at com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1880)
    at com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1872)
    at com.amazon.redshift.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:368)
    at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeInternal(RedshiftStatementImpl.java:514)
    at com.amazon.redshift.jdbc.RedshiftStatementImpl.execute(RedshiftStatementImpl.java:435)
    at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeWithFlags(RedshiftStatementImpl.java:376)
    at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeCachedSql(RedshiftStatementImpl.java:362)
    at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeWithFlags(RedshiftStatementImpl.java:339)
    at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeUpdate(RedshiftStatementImpl.java:297)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.executeStatement(JdbcUtils.scala:1026)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:912)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:80)

I see two possibilities here: 1. I guess here the problem is with the name of the column. I suppose bool is a keyword in Redshift as well and maybe you're not supposed to use it as a column name, try changing it to something else. 2. If 1st guess is not correct, then you might want to cast the bool column to string datatype before writing to Redshift — Rohit Nimmala, Feb 10 '22 at 19:21
Thanks for the answer, in this case the column name as bool was just illustrative, it's not the one used in the code, I just edited it to avoid confusion, in the second case, it really works, but it's mandatory for the analysis that this data remains boolean . anyway thank you very much for your answer — Paulo Victor, Feb 10 '22 at 20:15

dineshkumar g · Answer 1 · 2023-05-11T08:28:46.823

0

This error occurs because Redshift expects Boolean columns to be written as TRUE or FALSE, but PySpark writes them as true or false (lowercase). To fix this error, you can cast the Boolean columns to strings and then use the upper function to convert them to uppercase:**

from pyspark.sql.functions import col, upper
    
df = df.withColumn("col2_str", col("col2").cast("string"))
df = df.withColumn("col2_redshift", upper(col("col2_str")))
df = df.drop("col2").withColumnRenamed("col2_redshift", "col2")

After this try to write the dataframe.

edited May 11 '23 at 08:28

answered May 10 '23 at 10:51

dineshkumar g

1
1

1

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 10 '23 at 16:56

Error writing dataframe in redshift using pyspark with boolean columns

1 Answers1