Credit to this post, I have spark dataframe and I want to plot one of the columns based on the timestamp column. The problem is interesting column contains the missing values (Null
) and I would like not to draw missing values.
The excerpt of my data frame is as below which can be created easily by this answer :
# +---------------------+-------+
# | timestamp | col1 |
# +---------------------+-------+
# | 2021-05-10 19:48:36 | 714 |
# | 2021-05-10 15:34:26 | Null |
# | 2021-05-10 14:08:31 | 634 |
# | 2021-05-10 20:29:46 | 8453 |
# | 2021-05-10 19:48:36 | Null |
# | 2021-05-10 00:20:25 | 3825 |
# +---------------------+-------+
Apart from using interpolation or imputation of the column contains missing values suggested here and here, I would like to still keep the gaps and not touch nature of data in visualization like the following example:
What I tried was to use the following trick to convert data from pyspark into Pandas using toPandas()
and then apply pythonic scripts for plotting deploying seaborn
or matplotlib
offered here particularly sns.pointplot()
& sns.lineplot()
.
#read the data
sdf = spark.read.parquet(INPUT_PATH)
#Pandas
pdf = sdf.toPandas()
The problem with this trick is I'm not using Spark and its workers power and for visualizing the big data (circa 8M records) it takes so long time to catch the plot and monitor outliers over the available timestamp
window for reasoning outlier detection methods outputs.
Any help to update solutions for this problem will be appreciated!