We have a use case to prepare a spark job that'll read data from multiple providers, containing info about users present in some arbitrary order and write them back to files in S3. Now, the condition is, all of a user's data must be present in a single file. There are roughly about 1 million unique users, and each one of them has about 10KB of data, at max. We thought of creating at most 1000 files, and let each file contain about 1000 users' records.
We're using java dataframe apis for creating the job against spark 2.4.0. I can't wrap my head around what would be the most logical way of doing this? Should I do a group by operation on the user-id and then somehow collect the rows unless I reach 1000 users, and then roll over (if that's even possible) or there's some better way. Any help or a hint in the right direction is much appreciated.
Update:
After following the suggestion from the answer I went ahead with the following code snippet, still I saw 200 files being written, instead of 1000.
Properties props = PropLoader.getProps("PrepareData.properties");
SparkSession spark = SparkSession.builder().appName("prepareData").master("local[*]")
.config("fs.s3n.awsAccessKeyId", props.getProperty(Constants.S3_KEY_ID_KEY))
.config("fs.s3n.awsSecretAccessKey", props.getProperty(Constants.S3_SECERET_ACCESS_KEY)).getOrCreate();
Dataset<Row> dataSet = spark.read().option("header", true).csv(pathToRead);
dataSet.repartition(dataSet.col("idvalue")).coalesce(1000).write().parquet(pathToWrite);
spark.close();
But instead of 1000 if I use 100, then I see 100 files. Then I followed the link shared by @Alexandros, and the following code snippet generated more than 20000 files within their individual directories, and also the execution time shot up like crazy.
dataSet.repartition(1000, dataSet.col("idvalue")).write().partitionBy("idvalue").parquet(pathToWrite);