What is the best practice to get timeseries line plot in dataframe or list contains missing value in pyspark?

Question

Credit to this post, I have spark dataframe and I want to plot one of the columns based on the timestamp column. The problem is interesting column contains the missing values (Null) and I would like not to draw missing values.

The excerpt of my data frame is as below which can be created easily by this answer :

# +---------------------+-------+
# |      timestamp      |  col1 |
# +---------------------+-------+
# | 2021-05-10 19:48:36 |  714  |
# | 2021-05-10 15:34:26 |  Null |
# | 2021-05-10 14:08:31 |  634  |
# | 2021-05-10 20:29:46 |  8453 |
# | 2021-05-10 19:48:36 |  Null |
# | 2021-05-10 00:20:25 |  3825 |
# +---------------------+-------+

Apart from using interpolation or imputation of the column contains missing values suggested here and here, I would like to still keep the gaps and not touch nature of data in visualization like the following example:

What I tried was to use the following trick to convert data from pyspark into Pandas using toPandas() and then apply pythonic scripts for plotting deploying seaborn or matplotlib offered here particularly sns.pointplot() & sns.lineplot().

#read the data
sdf = spark.read.parquet(INPUT_PATH)
#Pandas
pdf = sdf.toPandas()

The problem with this trick is I'm not using Spark and its workers power and for visualizing the big data (circa 8M records) it takes so long time to catch the plot and monitor outliers over the available timestamp window for reasoning outlier detection methods outputs.

Any help to update solutions for this problem will be appreciated!

Would it be an option to take a [sample](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.sample.html) of the data before plotting? — werner, Jun 02 '21 at 19:16
@werner I used it already since I couldn't plot the whole data, but it's not really an option for this case. Since I might skip that event (outlier) when randomly sampled. The aim is to plot all data points despite their gaps due to missing values. I was thinking of replacing missing values with unique fixed values (e.g. 0.123456) and plot those parts as the same colour as background like white `#FFFFFF`. — Mario, Jun 02 '21 at 20:14

What is the best practice to get timeseries line plot in dataframe or list contains missing value in pyspark?

0 Answers0