I using Pyspark to read a file from S3, decompile it from protobuf format, and write it into a Redshift table. Everything is working fast and well, but the time it takes to write to Redshift is really slow. The DataFrame is about half a million rows and 100 columns. I followed tips here, here, and here, but still running incredibly slow. Here is my code:
rdd = sc.sequenceFile("s3a://bucket_name/file_path").values()
def f(x):
feed = proto_pb2.message()
feed.ParseFromString(bytes(x))
row_dict = {descriptor.name: getattr(feed, descriptor.name) for descriptor in feed.DESCRIPTOR.fields}
return row_dict
df = rdd.map(lambda x: Row(**f(x))).toDF()
df = df.coalesce(1)
df.write.format('jdbc').options(url='jdbc:redshift://redshift_url:5439/dev?rewriteBatchedStatements=true',
dbtable=table_stage,
user=user,
password=passwd,
batchsize=20000).mode('append').save()
I have tested uploading just 5 rows of data from my python interpreter and it takes a full 8 seconds!
>>> import time
>>> start = time.time()
>>> df_1.write.format('jdbc').options(url='jdbc:redshift://redshift_url:5439/dev?rewriteBatchedStatements=true',
dbtable=table_stage,
user=user,
password=passwd,
batchsize=20000).mode('append').save()
>>> end = time.time()
>>> print(end - start)
8.918693780899048
Something must be set up incorrect that I am missing. Help please!