How to trace the tree in pyspark RDD?

Asked Sep 07 '18 at 00:51

Active Sep 07 '18 at 08:39

Viewed 116 times

Problem Statement

Herewith mentioned the example and expected result. Tree is described with the 3 columns(tree depth is dynamic) and relationship exist in columns. It is required to loop them into one row by key in pyspark RDD. Any idea would be appreciated ? Thank you.

Example RDD:

(null,a1,null)
(null,a2,a1)
(null,a3,a2)
(null,a4,a3)
(b1,null,a4)

Expected Result

b1->a4->a3->a2->a1, result RDD: (b1,(a4,a3,a2,a1))

edited Sep 07 '18 at 08:39

asked Sep 07 '18 at 00:51

Chris

Do you have multiple trees in the same RDD? You could look into using Graphx since it would be more suitable for this type of data (depending on what you finally want to do with it). See: https://stackoverflow.com/questions/23302270/how-do-i-run-graphx-with-python-pyspark – Shaido Sep 07 '18 at 01:47
thanks a lot, only need the longest tree in the same RDD, and trees were split by non-null column 1 'b1'. – Chris Sep 07 '18 at 03:16

How to trace the tree in pyspark RDD?

0 Answers0