Highest Voted 'spark-window-function' Questions

5

votes

1 answer

Difference Beetween Window function and OrderBy in Spark

I have code that his goal is to take the 10M oldest records out of 1.5B records. I tried to do it with orderBy and it never finished and then I tried to do it with a window function and it finished after 15min. I understood that with orderBy every…

asked Apr 26 '22 at 11:56

dasilva555

93
1
2
12

3

votes

3 answers

How to calculate moving median in DataFrame?

Is there a way to calculate moving median for an attribute in Spark DataFrame? I was hoping that it is possible to calculate moving median using a window function (by defining a window using rowsBetween(0,10)), but there no functionality to…

apache-spark apache-spark-sql window-functions median spark-window-function

asked May 19 '17 at 04:36

Santosh Kumar

761
5
28

2

votes

1 answer

PySpark group by with rolling window

Suppose I have a table with three columns: dt, id and value. df_tmp = spark.createDataFrame([('2023-01-01', 1001, 5), ('2023-01-15', 1001, 3), ('2023-02-10', 1001, 1), …

apache-spark pyspark group-by window-functions spark-window-function

asked Aug 03 '23 at 20:55

Abhishek Parab

215
2
11

2

votes

1 answer

Spark - Calculating running sum with a threshold

I have a use-case where I need to compute running sum over a partition where the running sum does not exceed a certain threshold. For example: // Input dataset | id | created_on | value | running_sum | threshold | | -- | ----------- | ----- |…

apache-spark cumulative-sum spark-window-function

asked Apr 21 '23 at 07:28

sanketd617

809
5
16

2

votes

1 answer

Compute rolling percentiles in PySpark

I have a dataframe with dates, ID (let's say of a city) and two columns of temperatures (in my real dataframe I have a dozen of columns to compute). I want to "rank" those temperatures for a given window. I want this ranking to be scaled from 0 (the…

python pyspark window-functions percentile spark-window-function

asked Oct 11 '19 at 17:11

virgilus

141
1
11

1

vote

0 answers

Spark: How to get value for each day in interval?

I have a table with values and a date starting from which this value is valid: param validfrom value param1 01-01-2022 1 param2 03-01-2022 2 param1 05-01-2022 11 param1 07-01-2022 1 I need to get values of each parameter on each…

apache-spark apache-spark-sql window-functions spark-window-function

asked Dec 01 '22 at 13:40

Alexander Lopatin

560
6
18

1

vote

2 answers

Pyspak - calculate median value with a sliding time window

I have the following data frame in…

python pyspark aggregate spark-window-function

asked Sep 27 '22 at 09:39

Manuel

35
4

1

vote

1 answer

Spark Window Function Null Skew

Recently I've encountered an issue running one of our PySpark jobs. While analyzing the stages in Spark UI I have noticed that the longest running stage takes 1.2 hours to run out of the total 2.5 hours that takes for the entire process to…

apache-spark pyspark apache-spark-sql skew spark-window-function

asked Sep 17 '22 at 14:47

evyamiz

164
1
1
15

1

vote

2 answers

How to count consecutive days an event happens?

I need to calculate the number of consecutive days from today (2022-01-04) backwards a client logged in my application. I need to use pyspark due to the size of my database Input Name Date John 2022-01-01 John …

python pyspark spark-window-function

asked Aug 04 '22 at 21:10

karek77

31
3

1

vote

1 answer

How to run user defined function over a window in spark dataframe?

I am trying to detect the outliers from my spark dataframe. Below is the data sample. pressure Timestamp 358.64 2022-01-01 00:00:00 354.98 2022-01-01 00:10:00 350.34 2022-01-01 00:20:00 429.69 2022-01-01 00:30:00 420.41 2022-01-01…

apache-spark pyspark apache-spark-sql outliers spark-window-function

asked Jul 21 '22 at 15:16

FlashBang

33
8

1

vote

1 answer

Compare consecutive rows and extract words(excluding the subsets) in spark

I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid.…

scala apache-spark apache-spark-sql spark-window-function

asked Mar 18 '22 at 03:20

Abhi Sinha

13
4

0

votes

1 answer

ADD end of month column Dynamically to spark Dataframe

I have pyspark Dataframe as follows, I need to add EOM column to all the null values for each id dynamically based on last non null EOM value and it should be continuous. My output dataframe looks like this, I have tried this logic df.where("EOM…

python pyspark apache-spark-sql spark-window-function

asked Jun 02 '23 at 09:46

code_bug

355
1
12

0

votes

0 answers

Spark Structured Streaming not ingesting latest records outputMode append

I'm using spark structured streaming to ingest aggregated data using the outputMode append, however the most recent records are not being ingested. I'm ingesting yesterday's records streaming using Databricks autoloader. To write to my final table,…

databricks spark-structured-streaming autoload watermark spark-window-function

asked Apr 12 '23 at 13:02

Brenda Alexsandra

1
1

0

votes

0 answers

In pyspark, (or SQL) can I use the value calculated in the previous observation in the current observation. (rowwise calculation) (Like SAS Retain)

I want to be able to consecutively go through a table using the value calculated in the previous row in the current row. It seems a window function could do this. from pyspark.sql import SparkSession from pyspark.sql import Window import…

python pyspark rowwise spark-window-function

asked Apr 07 '23 at 18:17

Harlan Nelson

1,394
1
10
22

0

votes

1 answer

Spark with scala

Consider 2 dataframes holiday df and everyday df with 3 columns as below Holiday df: (5 records) Country_code|currency_code| date Gb | gbp | 2022-04-15 Gb | gbp | 2022-04-16 US | usd |…

scala apache-spark window-functions spark-window-function

asked Apr 05 '23 at 04:16

Vaibhav Kulkarni

11
3

Questions tagged [spark-window-function]