3

I using Pyspark to read a file from S3, decompile it from protobuf format, and write it into a Redshift table. Everything is working fast and well, but the time it takes to write to Redshift is really slow. The DataFrame is about half a million rows and 100 columns. I followed tips here, here, and here, but still running incredibly slow. Here is my code:

rdd = sc.sequenceFile("s3a://bucket_name/file_path").values()

    def f(x):
      feed = proto_pb2.message()
      feed.ParseFromString(bytes(x))
      row_dict = {descriptor.name: getattr(feed, descriptor.name) for descriptor in feed.DESCRIPTOR.fields}
      return row_dict

 df = rdd.map(lambda x: Row(**f(x))).toDF()

 df = df.coalesce(1)

 df.write.format('jdbc').options(url='jdbc:redshift://redshift_url:5439/dev?rewriteBatchedStatements=true', 
                                    dbtable=table_stage, 
                                    user=user, 
                                    password=passwd, 
                                    batchsize=20000).mode('append').save()

I have tested uploading just 5 rows of data from my python interpreter and it takes a full 8 seconds!

    >>> import time
    >>> start = time.time()
    >>> df_1.write.format('jdbc').options(url='jdbc:redshift://redshift_url:5439/dev?rewriteBatchedStatements=true', 
                             dbtable=table_stage, 
                             user=user, 
                             password=passwd, 
                             batchsize=20000).mode('append').save()
    >>> end = time.time()
    >>> print(end - start)
    8.918693780899048

Something must be set up incorrect that I am missing. Help please!

user3456269
  • 465
  • 2
  • 4
  • 14

0 Answers0