I have a use case that seems relatively simple to solve using Spark, but can't seem to figure out a sure way to do this.
I have a dataset which contains time series data for various users. All I'm looking to do is:
- partition this dataset by user ID
- sort the time series data for each user which by then should supposedly be contained within individual partitions,
- write each partition to a single CSV file. In the end I'd like to end up with 1 CSV file per user ID.
I tried using the following code snippet, but ended up getting surprising results. I do end up with 1 csv file per user ID and some users' time series data do end up getting sorted, but a lot of other users' were unsorted.
# repr(ds) = DataFrame[userId: string, timestamp: string, c1: float, c2: float, c3: float, ...]
ds = load_dataset(user_dataset_path)
ds.repartition("userId")
.sortWithinPartitions("timestamp")
.write
.partitionBy("userId")
.option("header", "true")
.csv(output_path)
I'm unclear as to why this could happen, and I'm not entirely sure how to do this. I'm also not sure if this is potentially a bug in Spark.
I'm using Spark 2.0.2 with Python 2.7.12. Any advice would be very much appreciated!