I have a table of co-occurrence counts stored on s3 (where each row is [key-a, key-b, count]) and I want to produce the co-occurrence probability matrix from it.
To do that I need to calculate the sum of the counts for each key-a, and then divide each row by the sum for its key-a.
If I were doing this "by hand" I would do a pass over the data to produce a hash table from keys to totals (in leveldb or something like it), and then make a second pass over the data to do the division. That doesn't sound like a very cascalog-y way to do it.
Is there some way I can get the total for a row by doing the equivalent of a self-join?