Questions tagged [spark-window-function]
31 questions
5
votes
1 answer
Difference Beetween Window function and OrderBy in Spark
I have code that his goal is to take the 10M oldest records out of 1.5B records.
I tried to do it with orderBy and it never finished and then I tried to do it with a window function and it finished after 15min.
I understood that with orderBy every…

dasilva555
- 93
- 1
- 2
- 12
3
votes
3 answers
How to calculate moving median in DataFrame?
Is there a way to calculate moving median for an attribute in Spark DataFrame?
I was hoping that it is possible to calculate moving median using a window function (by defining a window using rowsBetween(0,10)), but there no functionality to…

Santosh Kumar
- 761
- 5
- 28
2
votes
1 answer
PySpark group by with rolling window
Suppose I have a table with three columns: dt, id and value.
df_tmp = spark.createDataFrame([('2023-01-01', 1001, 5),
('2023-01-15', 1001, 3),
('2023-02-10', 1001, 1),
…

Abhishek Parab
- 215
- 2
- 11
2
votes
1 answer
Spark - Calculating running sum with a threshold
I have a use-case where I need to compute running sum over a partition where the running sum does not exceed a certain threshold.
For example:
// Input dataset
| id | created_on | value | running_sum | threshold |
| -- | ----------- | ----- |…

sanketd617
- 809
- 5
- 16
2
votes
1 answer
Compute rolling percentiles in PySpark
I have a dataframe with dates, ID (let's say of a city) and two columns of temperatures (in my real dataframe I have a dozen of columns to compute).
I want to "rank" those temperatures for a given window. I want this ranking to be scaled from 0 (the…

virgilus
- 141
- 1
- 11
1
vote
0 answers
Spark: How to get value for each day in interval?
I have a table with values and a date starting from which this value is valid:
param
validfrom
value
param1
01-01-2022
1
param2
03-01-2022
2
param1
05-01-2022
11
param1
07-01-2022
1
I need to get values of each parameter on each…

Alexander Lopatin
- 560
- 6
- 18
1
vote
2 answers
Pyspak - calculate median value with a sliding time window
I have the following data frame in…

Manuel
- 35
- 4
1
vote
1 answer
Spark Window Function Null Skew
Recently I've encountered an issue running one of our PySpark jobs. While analyzing the stages in Spark UI I have noticed that the longest running stage takes 1.2 hours to run out of the total 2.5 hours that takes for the entire process to…

evyamiz
- 164
- 1
- 1
- 15
1
vote
2 answers
How to count consecutive days an event happens?
I need to calculate the number of consecutive days from today (2022-01-04) backwards a client logged in my application. I need to use pyspark due to the size of my database
Input
Name Date
John 2022-01-01
John …

karek77
- 31
- 3
1
vote
1 answer
How to run user defined function over a window in spark dataframe?
I am trying to detect the outliers from my spark dataframe.
Below is the data sample.
pressure
Timestamp
358.64
2022-01-01 00:00:00
354.98
2022-01-01 00:10:00
350.34
2022-01-01 00:20:00
429.69
2022-01-01 00:30:00
420.41
2022-01-01…

FlashBang
- 33
- 8
1
vote
1 answer
Compare consecutive rows and extract words(excluding the subsets) in spark
I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid.…

Abhi Sinha
- 13
- 4
0
votes
1 answer
ADD end of month column Dynamically to spark Dataframe
I have pyspark Dataframe as follows,
I need to add EOM column to all the null values for each id dynamically based on last non null EOM value and it should be continuous.
My output dataframe looks like this,
I have tried this logic
df.where("EOM…

code_bug
- 355
- 1
- 12
0
votes
0 answers
Spark Structured Streaming not ingesting latest records outputMode append
I'm using spark structured streaming to ingest aggregated data using the outputMode append, however the most recent records are not being ingested.
I'm ingesting yesterday's records streaming using Databricks autoloader.
To write to my final table,…
0
votes
0 answers
In pyspark, (or SQL) can I use the value calculated in the previous observation in the current observation. (rowwise calculation) (Like SAS Retain)
I want to be able to consecutively go through a table using the value calculated in the previous row in the current row. It seems a window function could do this.
from pyspark.sql import SparkSession
from pyspark.sql import Window
import…

Harlan Nelson
- 1,394
- 1
- 10
- 22
0
votes
1 answer
Spark with scala
Consider 2 dataframes holiday df and everyday df with 3 columns as below
Holiday df: (5 records)
Country_code|currency_code| date
Gb | gbp | 2022-04-15
Gb | gbp | 2022-04-16
US | usd |…

Vaibhav Kulkarni
- 11
- 3