I am working with time series big data using pyspark, I have data in GB (100 GB or more) number of rows are in million or in billions. I am new to this big data using pyspark. want to resample (down sample) the data original data is in 10 Hz in timestamp in milliseconds i want to convert this data to 1 Hz in seconds. It would be really helpful if you could give me some idea. Also would be great if you could recommend me any documentation/solution i can use to deal with (huge data) big data using spark. Below is sample data. DF=
start_timestamp | end_timestamp | value |
---|---|---|
2020-11-05 03:25:02.088 | 2020-11-05 04:10:19.288 | 0.0 |
2020-11-05 04:24:25.288 | 2020-11-05 04:24:25.218 | 0.4375 |
2020-11-05 04:24:25.218 | 2020-11-05 04:24:25.318 | 1.0625 |
2020-11-05 04:24:25.318 | 2020-11-05 04:24:25.418 | 1.21875 |
2020-11-05 04:24:25.418 | 2020-11-05 04:24:25.518 | 1.234375 |
2020-11-05 04:24:25.518 | 2020-11-05 04:24:25.618 | 1.265625 |
2020-11-05 04:24:25.618 | 2020-11-05 04:24:25.718 | 1.28125 |
I tried code which i got on: PySpark: how to resample frequencies
Here is my sample code:
day = 1 #60 * 60 * 24
epoch = (col("start_timestamp").cast("bigint") / day).cast("bigint") * day
with_epoch = distinctDF.withColumn("epoch", epoch)
min_epoch, max_epoch = with_epoch.select(min_("epoch"), max_("epoch")).first()
ref = spark.range(
min_epoch, max_epoch + 1, day
).toDF("epoch")
(ref
.join(with_epoch, "epoch", "left")
.orderBy("epoch")
.withColumn("start_timestamp_resampled", timestamp_seconds("epoch"))
.show(15, False))
Code is working but i am not sure is it correct or not: output looks like below. But is it showing the null in the columns.
epoch | start_timestamp | end_timestamp | value | start_timestamp_resampled |
---|---|---|---|---|
1604546702 | 2020-11-05 03:25:02.088 | 2020-11-05 04:10:19.288 | 0.0 | 2020-11-05 03:25:02 |
1604546703 | null | null | null | 2020-11-05 03:25:03 |
1604546704 | null | null | null | 2020-11-05 03:25:04 |
1604546705 | null | null | null | 2020-11-05 03:25:05 |
1604546706 | null | null | null | 2020-11-05 03:25:06 |
1604546707 | null | null | null | 2020-11-05 03:25:07 |