0

My dataframe looks like this:

//+---+---------+
//| collection  | 
//+---+---------+
//|[9, 17, 24]  | 
//|[60, 6, 75]  | 
//|[18, 28, 38] | 
//|[9, 64]      |

All rows are sorted and with different length. Is there a way with spark to merge rows that share common elements?

//+---+---------+
//| collection  | 
//+---+---------+
//|[9,17, 24,64]| 
//|[60, 6, 75]  | 
//|[18, 28, 38] | 

The only solution which is very slow (if not impossible to work with a very large data frame 1b + rows ) is to collect all rows as a nested list with:

dat = all_frames.select("collection").rdd.flatMap(lambda x: x).collect()

and then a serial BFS or DFS.

drenf
  • 3
  • 1

1 Answers1

0

This is a typical connected components problem. You could use Spark GraphX to solve it. For each row, create a edge for each pair; use these edges to create a graph. Below is an example in Scala API.

import org.apache.spark.graphx.GraphLoader

// Load the graph as in the PageRank example
val graph = GraphLoader.edgeListFile(sc, "data/graphx/followers.txt")
// Find the connected components
val cc = graph.connectedComponents().vertices
// Join the connected components with the usernames
val users = sc.textFile("data/graphx/users.txt").map { line =>
  val fields = line.split(",")
  (fields(0).toLong, fields(1))
}
val ccByUsername = users.join(cc).map {
  case (id, (username, cc)) => (username, cc)
}
// Print the result
println(ccByUsername.collect().mkString("\n"))
Fang Zhang
  • 1,597
  • 18
  • 18