We have 2 Hive tables which are read in spark and joined using a join key, let’s call it user_id. Then, we write this joined dataset to S3 and register it hive as a 3rd table for subsequent tasks to use this joined dataset. One of the other columns in the joined dataset is called keychain_id.
We want to group all the user records belonging to the same keychain_id in the same partition for a reason to avoid shuffles later. So, can I do a repartition(“keychain_id”) before writing to s3 and registering it in Hive , and when I read the same data back from this third table will it still have the same partition grouping (all users belonging to the Same keychain_id in the same partition)? Because trying to avoid doing a repartition(“keychain_id”) every time when reading from this 3rd table. Can you please clarify ? If there is no guarantee that it will retain the same partition grouping while reading, then is there another efficient way this can be done other than caching?