1

I am an Apache Spark/Redis user and recently I tried spark-redis for a project. The program is generating PySpark dataframes with approximately 3 million lines, that I am writing in a Redis database using the command

df.write \
  .format("org.apache.spark.sql.redis") \
  .option("table", "person") \
  .option("key.column", "name") \
  .save()

as suggested at the GitHub project dataframe page.

However, I am getting inconsistent writing times for the same Spark cluster configuration (same number of EC2 instances and instance types). Sometimes it happens very fast, sometimes too slow. Is there any way to speed up this process and get consistent writing times? I wonder if it happens slowly when there are a lot of keys inside already, but it should not be an issue for a hash table, should it?

holypriest
  • 191
  • 8

1 Answers1

0

This could be a problem with your partition strategy.

Check Number of Partitions of "df" before writing and see if there is a relation between number of partitions and execution time.

If so, partitioning your "df" with suitable partiton stratigy (Re-partitioning to a fixed number of partitions or re-partitioning based on a column value) should resolve the problem.

Hope this helps.

  • Thanks, I will test the repartition strategy and notify you about the results. – holypriest Jan 24 '19 at 10:56
  • 1
    Apparently there is no relation between the number of partitions and the writing times. I tried to double the partitions or to cut them in half and got the same times. But there is a relation between the writing times and the number of keys already saved to the database. When the database is empty, it is pretty fast, and get slower and slower as the database gets filled. – holypriest Jan 30 '19 at 10:37
  • Since Redis is an in-memory data structure store, you might be loosing your physical memory significantly while writing. Low memory can lead to low performance. Can you describe the nature of your setup. Is it a single node? – Pubudu Sitinamaluwa Jan 30 '19 at 11:38
  • 1
    It is an ElastiCache Clustered Redis, with 2 r4.xlarge nodes. It is not supposed to suffer from memory issues with 2 million keys registered, which is far from its total capacity. With 0 keys I can write a 2-million rows dataframe in 3 minutes. With 2 million keys, the same dataframe takes 50 minutes to get written. It does not seem to be the expected behavior. – holypriest Jan 30 '19 at 21:15