How to filter Timestamp in Hours instead of Seconds?

Question

timestampbefore timestamp after @pissall's code I have a Timestamp column with 0.5Hz frequency, that results in millions of rows. I am willing to reduce this data size by having a timestamp in an hourly manner. i.e 24 observations for a particular day. I already reduced the data size by filtering the data by year, month and day. but as it is still very big i want to reduce it now to hourly basis.

I am working on Databricks and using PySpark for the same.

i used following command to reduce my data size from years to a Day.

df = df.filter(df.Timestamp.between('2019-09-03 00:00:00','2019-09-04 00:00:00'))

I would appreciate your help. Thanks

Java.util.gregori...

@pissall not really the sum/avg kind of aggregation. Instead, i want to fetch (aggregate) the values that occur every hour instead of every seconds. for example: `Time Stamp Latitude Longitude` `2019-09-03 00:00:00 132323 -3,54545` `2019-09-03 00:00:01 xxxx yyyy` `2019-09-03 00:00:02 aaa aaa` instead of that i want to have values at `2019-09-03 00:00:00 2019-09-03 01:00:00 2019-09-03 02:00:00` — Lav Mehta, Nov 18 '19 at 17:06

pissall · Answer 1 · 2019-11-18T17:40:42.080

0

You can replace the minutes and seconds part of your datetime using a UDF. Might not be the best solution, but here you go:

import pyspark.sql.functions as F
from pyspark.sql.types import TimestampType

date_replace_udf = F.udf(lambda date: date.replace(minute=0, second=0, microsecond=0),TimestampType())

df = df.withColumn("Timestamp", date_replace_udf(F.col("Timestamp")))

Another reference: How to truncate the time on a DateTime object in Python?

edited Nov 18 '19 at 17:40

answered Nov 18 '19 at 17:10

pissall

7,109
2
25
45

Thanks for a quick solution, but still it is returning data frame with a timestamp changing every second and not every hour. – Lav Mehta Nov 18 '19 at 17:17
Can you show the output? Please add it to your question. – pissall Nov 18 '19 at 17:18
@LavMehta You need to check `fixed_date` column. I'm editing my answer, please try again – pissall Nov 18 '19 at 17:24
thanks for the correction, but now i am getting "java.util.Gregori...." under column Timestamp – Lav Mehta Nov 18 '19 at 17:27
it's not throwing an Error instead that's only what i can see in the column. Please check the image i added in the question. – Lav Mehta Nov 18 '19 at 17:32
`date_replace_udf = F.udf(lambda date: date.replace(minute=0, second=0, microsecond=0),TimestampType())` this way it helped to tackle Java.util.Gregori problem in Column. – Lav Mehta Nov 18 '19 at 17:39
@LavMehta Updated my answer. Please select and upvote if helped – pissall Nov 18 '19 at 17:42
The number of rows are still not reduced, now it is showing minutes and seconds as 00 and hour non zero, but then i have multiple rows with same hour. `2019-09-03 04:00:00` `2019-09-03 04:00:00` – Lav Mehta Nov 18 '19 at 17:49
@LavMehta for that you'll have to go an aggregationlike sum/mean/max etc – pissall Nov 18 '19 at 17:50

How to filter Timestamp in Hours instead of Seconds?

1 Answers1