0

I am using Scala and trying to filter a dataset on a string column with a date and time value. I have been looking at several posts and trying to use SimpleDateFormat techniques to no avail.

My sample data is:

+----------------------+
|my_date_str           |
+----------------------+
|12/11/2018 08:01:55 AM|
|12/11/2018 08:33:22 PM|
|12/13/2018 09:25:28 PM|
|12/17/2018 07:27:36 PM|
+----------------------+

I'd like to keep rows between 7pm and 9pm (date does not matter, only time). I would expect to keep these two rows of the four:

12/17/2018 07:27:36 PM
12/11/2018 08:33:22 PM

I can hack this together using substring functions but I imagine there is a better way using a to_date or a unix function (I tried converting to seconds with unix_timestamp() then extrapolating the time somehow?), isolating the time, and checking the hour value.

// Filter down to rows between 7 and 9 and PM
my_data.withColumn("hour_str", substring($"my_date_str", 12, 8))
    .filter( (substring($"my_date_str", -2, 2) === "PM") && ($"my_date_str" >= "07:00:00") && ($"my_date_str" <= "09:00:00") )
    .show(truncate=false)

Too many failed attempts to include but these are a couple posts I used:
How to convert unix timestamp to date in Spark
How to convert String to date time in Scala?

If it's not clear, question is how to effectively filter a string date by an hour range?

md2614
  • 353
  • 2
  • 14

1 Answers1

1

You need "normal" timestamp, not unix_timestamp.

Something like this should work, though I am somewhat rusty on the exact incantantions:

df
.withColumn(
  "hour", 
  hour(to_timestamp($"foo", "MM/dd/yyy hh:mm:ss a"))
).filter($"hour" between (19,20))
Dima
  • 39,570
  • 6
  • 44
  • 70
  • Hmm, I'm trying to get the pattern correct as well. In the conversion, an AM row returns the correct hour but the PM rows are null. – md2614 Apr 08 '21 at 18:54
  • 1
    This seems to work correctly: hour(to_timestamp($"foo", "MM/dd/yyyy hh:mm:ss a")) – md2614 Apr 08 '21 at 19:04