I'm trying to improve a process in spark-sql. I have two processes in batch, the output of one is the input of the second, I need them splitted.
I have a table in my first process, partitioned using spark-sql by a key loaded, if i persist it in a datastore, spark is going to loose the tracking of the hash used for this table. Later i need to load this data in other process, and make some joins with other data, the key for this join in the data loaded from the other process, is going to be the same as the previous one. And in this case, spark load the data, but with the lack of knowledge of the hashing used to persist it, it's going to redo the shuffling to put the data in the expected spark-sql partitions. As the the number of sql partitions is the same in both processes, the key also is the same, i think that this last shuffle is avoidable, but i don't know how.
In resume, i want to know a way to persist in a hdfs datastore the data, in a way that i can preserve the HashPartition that spark-sql putted by a key, to improve following readings avoiding that first shuffle. Or in less words. I want to read a partitioned table, keaping the traking of the partition key in the table, to avoid shuffling.
A pseudocode of what i want to do:
val path = "/home/data"
ds.repartition(col("key")).write.parquet(path)
//in other spark-sql process
sparkSession.read.parquet(path).repartition(col("key"))
//i know i need this last repartition
//but how could i make it as much efficient as i could