My dataframe looks like this:
//+---+---------+
//| collection |
//+---+---------+
//|[9, 17, 24] |
//|[60, 6, 75] |
//|[18, 28, 38] |
//|[9, 64] |
All rows are sorted and with different length. Is there a way with spark to merge rows that share common elements?
//+---+---------+
//| collection |
//+---+---------+
//|[9,17, 24,64]|
//|[60, 6, 75] |
//|[18, 28, 38] |
The only solution which is very slow (if not impossible to work with a very large data frame 1b + rows ) is to collect all rows as a nested list with:
dat = all_frames.select("collection").rdd.flatMap(lambda x: x).collect()
and then a serial BFS or DFS.