Questions tagged [spark-window-function]

31 questions
5
votes
1 answer

Difference Beetween Window function and OrderBy in Spark

I have code that his goal is to take the 10M oldest records out of 1.5B records. I tried to do it with orderBy and it never finished and then I tried to do it with a window function and it finished after 15min. I understood that with orderBy every…
3
votes
3 answers

How to calculate moving median in DataFrame?

Is there a way to calculate moving median for an attribute in Spark DataFrame? I was hoping that it is possible to calculate moving median using a window function (by defining a window using rowsBetween(0,10)), but there no functionality to…
2
votes
1 answer

PySpark group by with rolling window

Suppose I have a table with three columns: dt, id and value. df_tmp = spark.createDataFrame([('2023-01-01', 1001, 5), ('2023-01-15', 1001, 3), ('2023-02-10', 1001, 1), …
2
votes
1 answer

Spark - Calculating running sum with a threshold

I have a use-case where I need to compute running sum over a partition where the running sum does not exceed a certain threshold. For example: // Input dataset | id | created_on | value | running_sum | threshold | | -- | ----------- | ----- |…
sanketd617
  • 809
  • 5
  • 16
2
votes
1 answer

Compute rolling percentiles in PySpark

I have a dataframe with dates, ID (let's say of a city) and two columns of temperatures (in my real dataframe I have a dozen of columns to compute). I want to "rank" those temperatures for a given window. I want this ranking to be scaled from 0 (the…
1
vote
0 answers

Spark: How to get value for each day in interval?

I have a table with values and a date starting from which this value is valid: param validfrom value param1 01-01-2022 1 param2 03-01-2022 2 param1 05-01-2022 11 param1 07-01-2022 1 I need to get values of each parameter on each…
1
vote
2 answers

Pyspak - calculate median value with a sliding time window

I have the following data frame in…
Manuel
  • 35
  • 4
1
vote
1 answer

Spark Window Function Null Skew

Recently I've encountered an issue running one of our PySpark jobs. While analyzing the stages in Spark UI I have noticed that the longest running stage takes 1.2 hours to run out of the total 2.5 hours that takes for the entire process to…
1
vote
2 answers

How to count consecutive days an event happens?

I need to calculate the number of consecutive days from today (2022-01-04) backwards a client logged in my application. I need to use pyspark due to the size of my database Input Name Date John 2022-01-01 John …
karek77
  • 31
  • 3
1
vote
1 answer

How to run user defined function over a window in spark dataframe?

I am trying to detect the outliers from my spark dataframe. Below is the data sample. pressure Timestamp 358.64 2022-01-01 00:00:00 354.98 2022-01-01 00:10:00 350.34 2022-01-01 00:20:00 429.69 2022-01-01 00:30:00 420.41 2022-01-01…
1
vote
1 answer

Compare consecutive rows and extract words(excluding the subsets) in spark

I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid.…
0
votes
1 answer

ADD end of month column Dynamically to spark Dataframe

I have pyspark Dataframe as follows, I need to add EOM column to all the null values for each id dynamically based on last non null EOM value and it should be continuous. My output dataframe looks like this, I have tried this logic df.where("EOM…
code_bug
  • 355
  • 1
  • 12
0
votes
0 answers

Spark Structured Streaming not ingesting latest records outputMode append

I'm using spark structured streaming to ingest aggregated data using the outputMode append, however the most recent records are not being ingested. I'm ingesting yesterday's records streaming using Databricks autoloader. To write to my final table,…
0
votes
0 answers

In pyspark, (or SQL) can I use the value calculated in the previous observation in the current observation. (rowwise calculation) (Like SAS Retain)

I want to be able to consecutively go through a table using the value calculated in the previous row in the current row. It seems a window function could do this. from pyspark.sql import SparkSession from pyspark.sql import Window import…
Harlan Nelson
  • 1,394
  • 1
  • 10
  • 22
0
votes
1 answer

Spark with scala

Consider 2 dataframes holiday df and everyday df with 3 columns as below Holiday df: (5 records) Country_code|currency_code| date Gb | gbp | 2022-04-15 Gb | gbp | 2022-04-16 US | usd |…
1
2 3