0

I am trying to store space in a varchar column in redshift.

My data comes in a csv format and looks like this,

"id","first_name","last_name","doj","address"
"A1111","B1111","C1111","D111","E111"
"A2222","B22222",""," ","E22"
"A3333","  ","C333","D333","E3333"

I store it in s3 and later read it via spark in a glue job and write to redshift as it is.

My code

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
from pyspark.sql.types import StringType
import pyspark.sql.functions as f

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)


df = spark.read\
    .format("csv")\
    .option("header", True)\
    .load("s3://bucket/file.csv")

df.show()

redshift_conn = glueContext.extract_jdbc_conf('...')
redshift_connection_options = {
        "url": redshift_conn['url'],
        "user": redshift_conn['user'],
        "password": redshift_conn['password'],
        "database": "...",
        "dbtable": "...",
        "redshiftTmpDir": "..."
    }


finalDF = DynamicFrame.fromDF(df, glueContext, "finalDF")


glueContext.write_dynamic_frame_from_jdbc_conf(
    frame=finalDF, catalog_connection="...", connection_options=redshift_connection_options)


job.commit()

When I try to run the above CSV through the script, I get the spaces trimmed in there.

I have tried to run the script in glue 3.0 and 4.0, the result is the same.

Are there any solutions to this case?

Tushar Patil
  • 748
  • 4
  • 13

0 Answers0