Drop duplicates in pyspark time series dataframe

Asked Jun 09 '16 at 17:59

Active Jun 09 '16 at 17:59

Viewed 357 times

This is a port of Read range of files in pySpark for spark.

I have time series data in a data frame that looks like this:

Index Time Value_A Value_B
0     1    A       A
1     2    A       A
2     2    B       A
3     3    A       A
4     5    A       A

I want to drop duplicate in the Value_A and Value_B columns such that duplicates are only dropped until a different pattern is encountered. The result for this sample data should be:

Index Time Value_A Value_B
0     1    A       A
2     2    B       A
3     3    A       A

edited May 23 '17 at 10:33

Community

asked Jun 09 '16 at 17:59

deltap

4,176
7
26
35

You need a window function. – Alberto Bonsanto Jun 09 '16 at 19:47
Could you maybe add a few more examples? E.g. extend your dataframe and desired output to be a bit longer? I think I can see your pattern but not sure. Also, write out the data in a way that is easy to copy paste. – Katya Willard Jun 13 '16 at 17:09

Drop duplicates in pyspark time series dataframe

0 Answers0