I'm using spark structured streaming to ingest aggregated data using the outputMode append, however the most recent records are not being ingested.
I'm ingesting yesterday's records streaming using Databricks autoloader. To write to my final table, I need to do some aggregation, and since I'm using the outputMode = 'append' I'm using the watermark with window.
The ranges I set are the following:
df_sum = df.withWatermark('updated_at', "15 minutes").groupBy(F.window('updated_at', "15 minutes"), *group_by_columns).agg(*exprs)
When making a display on this dataframe, I get all the lines, and the oldest and most recent being at 2023-04-11 13:11:30.227 and 2023-04-11 18:14:27.956 respectively.
Finally, I'm writing to a table and reading from it right after.
`df_sum.writeStream.option("checkpointLocation",checkpoint_path).toTable("my_table.autoloader_gold")
spark.readStream.table("my_table.autoloader_gold"").display()`
table display() Only the oldest records are being written. What could be happening?
I need my data to be transformed and ingested as soon as it arrives (of course ).
Edit: I just sent a new request to trigger the pipeline and ingest a new record. The missing record was ingested seconds after my request, and the data from the request itself was not.