I have a table with 3 columns: a date, an id, and a key. I'm hoping to find an efficient way to sum the pair-wise instances of keys within an id, then combine with totals across other ids. Basically building a list of temporal transitions. For example:
Input:
╔══════════╦════╦═════╗ ║ Date ║ ID ║ Key ║ ╠══════════╬════╬═════╣ ║ 1/1/2018 ║ A ║ XY ║ ║ 1/2/2018 ║ A ║ GT ║ ║ 1/6/2018 ║ A ║ WE ║ ║ 1/9/2018 ║ A ║ PO ║ ║ 1/2/2018 ║ B ║ XY ║ ║ 1/4/2018 ║ B ║ GT ║ ╚══════════╩════╩═════╝
Output:
╔══════════╦═══════════╦═══════╗ ║ FirstKey ║ SecondKey ║ Count ║ ╠══════════╬═══════════╬═══════╣ ║ XY ║ GT ║ 2 ║ ║ GT ║ WE ║ 1 ║ ║ WE ║ PO ║ 1 ║ ╚══════════╩═══════════╩═══════╝
It'd be trivially simple to order by ID, then Date then just loop through and building the counts, but I was hoping one of you gurus might be able to help me structure it to be more parallelized/efficient.
Basically, since ordered by date, I am trying to capture the number of transitions in time between keys. So for ID=A, we have XY, then we have GT (so increment XY->GT by 1). Then we have GT followed by WE (so increment GT->PO by 1).
Working on spark with scala/python.