I'm new to Spark and Scala, and I am trying to read a bunch of tweeter data from a JSON file and turn that into a graph where a vertex represents a tweet and the edge connects to tweets which are a re-tweet of the original posted item. So far I have managed to read from the JSON file and figure out the Schema of my RDD. Now I believe I need to somehow take the data from the SchemaRDD object and create an RDD for the Vertices and an RDD for the edges. Is this the way to approach this or is there an alternative solution? Any help and suggestions would be highly appreciated.
Asked
Active
Viewed 1,181 times
1 Answers
0
This really depends on your json file. You need to parse the data from the json file and create your vertices and edges based on the parsed data. There isnt a certain way to implement this, its really up to the programmer. One approach is to create a vertices array and edges array (again based on the parsed data) and parallelize those (create VertexRDD and EdgeRDD), and then create the graph you need. Hope I helped.

Al Jenssen
- 655
- 3
- 9
- 25
-
But array is not a RDD which can hold big data. Correct me if I am wrong, I don't think I can create an array of say 1 million row, right? If that is the case then array may not work with big data. – Tara Apr 06 '16 at 19:56
-
Yes, that is correct. Unfortunately, you cannot add a new element to an RDD. One way to do this though is if you dont wait to fill the Array, but parallelize it every n additions and then union the already parallelized RDD with the new one. – Al Jenssen Apr 09 '16 at 10:55