I have a Spark application that processes log files. However, I would like to save to different folder based on the date of the file. I believe I could do that row by row but would that be inefficient? I'm thinking of I can run a group by query to group by just date and output that to folders. How do I do that?
Here's what I have right now which is based on the current date
jsonRows.foreachRDD(r => {
val parsedFormat = new SimpleDateFormat("yyyy-MM-dd/")
val parsedDate = parsedFormat.format(new java.util.Date())
val OutputPath = destinationBucket + "/parsed_logs/orc/dt=" + parsedDate
val jsonDf = sqlSession.read.schema(Schema.schema).json(r)
val writer = jsonDf.write.mode("append").format("orc").option("compression", "zlib")
if (environment.equals("local")) {
writer.save("/tmp/sparrow")
} else {
writer.save(OutputPath)
}
})
Here's the sample of the dataframe
_ts
2018-01-02:10:10:10
2018-01-02:10:10:10
2018-01-03:10:10:10
I would like to group by date so I would have two groups
2018-01-02:10:10:10
2018-01-02:10:10:10
and
2018-01-03:02:10:10:10
And I would like to save them separately to two folders
2018-01-02
and 2018-01-03
How can I achieve that?