Pyspark dataframe merge rows of a list column

Question

My dataframe looks like this:

//+---+---------+
//| collection  | 
//+---+---------+
//|[9, 17, 24]  | 
//|[60, 6, 75]  | 
//|[18, 28, 38] | 
//|[9, 64]      |

All rows are sorted and with different length. Is there a way with spark to merge rows that share common elements?

//+---+---------+
//| collection  | 
//+---+---------+
//|[9,17, 24,64]| 
//|[60, 6, 75]  | 
//|[18, 28, 38] |

The only solution which is very slow (if not impossible to work with a very large data frame 1b + rows ) is to collect all rows as a nested list with:

dat = all_frames.select("collection").rdd.flatMap(lambda x: x).collect()

and then a serial BFS or DFS.

score 0 · Accepted Answer · answered Dec 17 '19 at 19:27

0

This is a typical connected components problem. You could use Spark GraphX to solve it. For each row, create a edge for each pair; use these edges to create a graph. Below is an example in Scala API.

import org.apache.spark.graphx.GraphLoader

// Load the graph as in the PageRank example
val graph = GraphLoader.edgeListFile(sc, "data/graphx/followers.txt")
// Find the connected components
val cc = graph.connectedComponents().vertices
// Join the connected components with the usernames
val users = sc.textFile("data/graphx/users.txt").map { line =>
  val fields = line.split(",")
  (fields(0).toLong, fields(1))
}
val ccByUsername = users.join(cc).map {
  case (id, (username, cc)) => (username, cc)
}
// Print the result
println(ccByUsername.collect().mkString("\n"))

answered Dec 17 '19 at 19:27

Fang Zhang

1,597
18
18

GraphX is indeed a great solution but how i can use it with python (pyspark)?I think GraphX doesn't support python/r – drenf Dec 17 '19 at 19:31
Check this thread. https://stackoverflow.com/questions/23302270/how-do-i-run-graphx-with-python-pyspark – Fang Zhang Dec 17 '19 at 19:53
GraphFrames is a wrapper for GraphX. – Fang Zhang Dec 17 '19 at 19:55

Pyspark dataframe merge rows of a list column

1 Answers1