0

Problem Statement

Herewith mentioned the example and expected result. Tree is described with the 3 columns(tree depth is dynamic) and relationship exist in columns. It is required to loop them into one row by key in pyspark RDD. Any idea would be appreciated ? Thank you.

Example RDD:

(null,a1,null)
(null,a2,a1)
(null,a3,a2)
(null,a4,a3)
(b1,null,a4)

Expected Result

b1->a4->a3->a2->a1, result RDD: (b1,(a4,a3,a2,a1))
Chris
  • 1
  • 2
  • Do you have multiple trees in the same RDD? You could look into using Graphx since it would be more suitable for this type of data (depending on what you finally want to do with it). See: https://stackoverflow.com/questions/23302270/how-do-i-run-graphx-with-python-pyspark – Shaido Sep 07 '18 at 01:47
  • thanks a lot, only need the longest tree in the same RDD, and trees were split by non-null column 1 'b1'. – Chris Sep 07 '18 at 03:16

0 Answers0