I have this kind of DataFrame in Spark
+---+--------+-----+
| A| date|value|
+---+--------+-----+
| 1|12/06/15| 0,0|
| 1|17/06/15| 0,0|
| 3|12/06/15| 0,0|
| 3|17/06/15| 0,0|
| 4|12/06/15| 0,0|
| 4|17/06/15| 0,0|
| 1|12/06/15| 0,0|
| 1|17/06/15| 0,0|
| 3|12/06/15| 65,4|
| 3|17/06/15| 40,7|
| 4|12/06/15| 73,1|
| 4|17/06/15| 33,3|
....
+---+--------+-----+
where the A
values are periodic: 1 -> 3 -> 4 -> 1 -> 3 -> 4 -> ...
What I need to do is to add another column T
to create a unique key {T, A, date}
for my records:
+---+---+--------+-----+
| T| A| date|value|
+---+---+--------+-----+
| 1| 1|12/06/15| 0,0|
| 1| 1|17/06/15| 0,0|
| 1| 3|12/06/15| 0,0|
| 1| 3|17/06/15| 0,0|
| 1| 4|12/06/15| 0,0|
| 1| 4|17/06/15| 0,0|
| 2| 1|12/06/15| 0,0|
| 2| 1|17/06/15| 0,0|
| 2| 3|12/06/15| 65,4|
| 2| 3|17/06/15| 40,7|
| 2| 4|12/06/15| 73,1|
| 2| 4|17/06/15| 33,3|
........
+---+---+--------+-----+
I saw that the withColumn
DataFrame method allows adding additional columns to DF and that it is possible to compute T
values out of other elements of the current row. The problem I'm facing here is that I would like to increment the value of the new T
column if and only if there is already an element with the same {A, date}
value in the original DF.
What is the best way to do this in Spark?