0

I have a Spark application that processes log files. However, I would like to save to different folder based on the date of the file. I believe I could do that row by row but would that be inefficient? I'm thinking of I can run a group by query to group by just date and output that to folders. How do I do that?

Here's what I have right now which is based on the current date

jsonRows.foreachRDD(r => {
      val parsedFormat = new SimpleDateFormat("yyyy-MM-dd/")
      val parsedDate = parsedFormat.format(new java.util.Date())
      val OutputPath = destinationBucket + "/parsed_logs/orc/dt=" + parsedDate

      val jsonDf = sqlSession.read.schema(Schema.schema).json(r)

      val writer = jsonDf.write.mode("append").format("orc").option("compression", "zlib")

      if (environment.equals("local")) {
        writer.save("/tmp/sparrow")
      } else {
        writer.save(OutputPath)
      }
    })

Here's the sample of the dataframe

_ts
2018-01-02:10:10:10
2018-01-02:10:10:10
2018-01-03:10:10:10

I would like to group by date so I would have two groups

2018-01-02:10:10:10
2018-01-02:10:10:10

and

2018-01-03:02:10:10:10

And I would like to save them separately to two folders

2018-01-02 and 2018-01-03

How can I achieve that?

pault
  • 41,343
  • 15
  • 107
  • 149
toy
  • 11,711
  • 24
  • 93
  • 176

0 Answers0