How do I group by dataframe by date and output to different folders based on the date in Spark?

Question

I have a Spark application that processes log files. However, I would like to save to different folder based on the date of the file. I believe I could do that row by row but would that be inefficient? I'm thinking of I can run a group by query to group by just date and output that to folders. How do I do that?

Here's what I have right now which is based on the current date

jsonRows.foreachRDD(r => {
      val parsedFormat = new SimpleDateFormat("yyyy-MM-dd/")
      val parsedDate = parsedFormat.format(new java.util.Date())
      val OutputPath = destinationBucket + "/parsed_logs/orc/dt=" + parsedDate

      val jsonDf = sqlSession.read.schema(Schema.schema).json(r)

      val writer = jsonDf.write.mode("append").format("orc").option("compression", "zlib")

      if (environment.equals("local")) {
        writer.save("/tmp/sparrow")
      } else {
        writer.save(OutputPath)
      }
    })

Here's the sample of the dataframe

_ts
2018-01-02:10:10:10
2018-01-02:10:10:10
2018-01-03:10:10:10

I would like to group by date so I would have two groups

2018-01-02:10:10:10
2018-01-02:10:10:10

and

2018-01-03:02:10:10:10

And I would like to save them separately to two folders

2018-01-02 and 2018-01-03

How can I achieve that?

Create a new column by extracting date from timestamp and write by using `df.write.partitionBy("date_col"). ...` — philantrovert, Jan 22 '18 at 16:52
Possible duplicate of [Write to multiple outputs by key Spark - one Spark job](https://stackoverflow.com/q/23995040/8371915) — Alper t. Turker, Jan 22 '18 at 17:06

How do I group by dataframe by date and output to different folders based on the date in Spark?

0 Answers0