2

I have this kind of DataFrame in Spark

+---+--------+-----+
|  A|    date|value|
+---+--------+-----+
|  1|12/06/15|  0,0|
|  1|17/06/15|  0,0|
|  3|12/06/15|  0,0|
|  3|17/06/15|  0,0|
|  4|12/06/15|  0,0|
|  4|17/06/15|  0,0|
|  1|12/06/15|  0,0|
|  1|17/06/15|  0,0|
|  3|12/06/15| 65,4|
|  3|17/06/15| 40,7|
|  4|12/06/15| 73,1|
|  4|17/06/15| 33,3|
....
+---+--------+-----+

where the A values are periodic: 1 -> 3 -> 4 -> 1 -> 3 -> 4 -> ...

What I need to do is to add another column T to create a unique key {T, A, date} for my records:

+---+---+--------+-----+
|  T|  A|    date|value|
+---+---+--------+-----+
|  1|  1|12/06/15|  0,0|
|  1|  1|17/06/15|  0,0|
|  1|  3|12/06/15|  0,0|
|  1|  3|17/06/15|  0,0|
|  1|  4|12/06/15|  0,0|
|  1|  4|17/06/15|  0,0|
|  2|  1|12/06/15|  0,0|
|  2|  1|17/06/15|  0,0|
|  2|  3|12/06/15| 65,4|
|  2|  3|17/06/15| 40,7|
|  2|  4|12/06/15| 73,1|
|  2|  4|17/06/15| 33,3|
........
+---+---+--------+-----+

I saw that the withColumn DataFrame method allows adding additional columns to DF and that it is possible to compute T values out of other elements of the current row. The problem I'm facing here is that I would like to increment the value of the new T column if and only if there is already an element with the same {A, date} value in the original DF.

What is the best way to do this in Spark?

matteo rulli
  • 1,443
  • 2
  • 18
  • 30
  • 1
    @zero323 It is contiguous data: the point is the original dataset is missing the additional Id I'm trying to rebuild – matteo rulli Jul 07 '16 at 18:49
  • If that's the case there is no generic and _efficient_ solution with Spark SQL alone. And if you drop SQL then it is just a variant of http://stackoverflow.com/q/35154267/1560062, don't you think? – zero323 Jul 07 '16 at 19:09

0 Answers0