0

timestampbefore timestamp after @pissall's code I have a Timestamp column with 0.5Hz frequency, that results in millions of rows. I am willing to reduce this data size by having a timestamp in an hourly manner. i.e 24 observations for a particular day. I already reduced the data size by filtering the data by year, month and day. but as it is still very big i want to reduce it now to hourly basis.

I am working on Databricks and using PySpark for the same.

i used following command to reduce my data size from years to a Day.

df = df.filter(df.Timestamp.between('2019-09-03 00:00:00','2019-09-04 00:00:00'))

I would appreciate your help. Thanks

Java.util.gregori...

Lav Mehta
  • 92
  • 1
  • 2
  • 13
  • Do you want to aggregate to hours? – pissall Nov 18 '19 at 16:02
  • @pissall yes exactly i want to aggregate data to hours. – Lav Mehta Nov 18 '19 at 16:47
  • What would the aggregations be? sum/avg/etc? – pissall Nov 18 '19 at 16:49
  • @pissall not really the sum/avg kind of aggregation. Instead, i want to fetch (aggregate) the values that occur every hour instead of every seconds. for example: `Time Stamp Latitude Longitude` `2019-09-03 00:00:00 132323 -3,54545` `2019-09-03 00:00:01 xxxx yyyy` `2019-09-03 00:00:02 aaa aaa` instead of that i want to have values at `2019-09-03 00:00:00 2019-09-03 01:00:00 2019-09-03 02:00:00` – Lav Mehta Nov 18 '19 at 17:06

1 Answers1

0

You can replace the minutes and seconds part of your datetime using a UDF. Might not be the best solution, but here you go:

import pyspark.sql.functions as F
from pyspark.sql.types import TimestampType

date_replace_udf = F.udf(lambda date: date.replace(minute=0, second=0, microsecond=0),TimestampType())

df = df.withColumn("Timestamp", date_replace_udf(F.col("Timestamp")))

Another reference: How to truncate the time on a DateTime object in Python?

pissall
  • 7,109
  • 2
  • 25
  • 45
  • Thanks for a quick solution, but still it is returning data frame with a timestamp changing every second and not every hour. – Lav Mehta Nov 18 '19 at 17:17
  • Can you show the output? Please add it to your question. – pissall Nov 18 '19 at 17:18
  • @LavMehta You need to check `fixed_date` column. I'm editing my answer, please try again – pissall Nov 18 '19 at 17:24
  • thanks for the correction, but now i am getting "java.util.Gregori...." under column Timestamp – Lav Mehta Nov 18 '19 at 17:27
  • it's not throwing an Error instead that's only what i can see in the column. Please check the image i added in the question. – Lav Mehta Nov 18 '19 at 17:32
  • `date_replace_udf = F.udf(lambda date: date.replace(minute=0, second=0, microsecond=0),TimestampType())` this way it helped to tackle Java.util.Gregori problem in Column. – Lav Mehta Nov 18 '19 at 17:39
  • @LavMehta Updated my answer. Please select and upvote if helped – pissall Nov 18 '19 at 17:42
  • The number of rows are still not reduced, now it is showing minutes and seconds as 00 and hour non zero, but then i have multiple rows with same hour. `2019-09-03 04:00:00` `2019-09-03 04:00:00` – Lav Mehta Nov 18 '19 at 17:49
  • @LavMehta for that you'll have to go an aggregationlike sum/mean/max etc – pissall Nov 18 '19 at 17:50