Cannot create graph in GraphX (Scala Spark)

Question

I have huge problems creating a simple graph in Spark GraphX. I really don't understand anything so I try everything that I find but nothing works. For example I try to reproduce the steps from here.

The following two were OK:

val flightsFromTo = df_1.select($"Origin",$"Dest")

val airportCodes = df_1.select($"Origin", $"Dest").flatMap(x => Iterable(x(0).toString, x(1).toString))

But after this I obtain an error:

val airportVertices: RDD[(VertexId, String)] = airportCodes.distinct().map(x => (MurmurHash.stringHash(x), x))

Error: missing Parameter type

Could You please tell me what is wrong?

And by the way, why MurmurHash? What is a purpose of it?

Tom Lous · Accepted Answer · 2018-04-19T11:11:32.853

0

My guess is that you are working from a 3 year old tutorial with a recent Spark version. The sqlContext read returns a Dataset instead of RDD. If you want it like the tutorial use .rdd. instead

val airportVertices: RDD[(VertexId, String)] = airportCodes.rdd.distinct().map(x => (MurmurHash3.stringHash(x), x))

or change type of variable

val airportVertices: Dataset[(Int, String)] = airportCodes.distinct().map(x => (MurmurHash3.stringHash(x), x))

You could also checkout https://graphframes.github.io/ if you are interested in Graphs and Spark

Updated

To create a Graph you need vertices and edges To make computation easier all vertices have to be identified by a VertexId (in essence a Long)

The MurmerHash is used to create very good unique hashes. More info here: MurmurHash - what is it?

Hashing is a best practise to distribute the data without skewing, but there is no technical reason why you couldn't use an incremental counter for each vertex

I've looked at the tutorial, but the only thing you have to change to make it work, is to add .rdd:

val flightsFromTo = df_1.select($"Origin",$"Dest").rdd
val airportCodes = df_1.select($"Origin", $"Dest").flatMap(x => Iterable(x(0).toString, x(1).toString)).rdd

edited Apr 19 '18 at 11:11

answered Apr 18 '18 at 15:05

Tom Lous

2,819
2
25
46

As accurate as your answer is, graphx didn't change much since and going back to RDDs isn't always a bad idea. Graphframes are still experimental nevertheless. – eliasah Apr 18 '18 at 15:19
Thanks a lot, this worked. But the next command doesnt' work: val flightEdges = flightsFromTo.map(x => ((MurmurHash.stringHash(x(0).toString),MurmurHash.stringHash(x(1).toString)), 1)).reduceByKey(_+_).map(x => Edge(x._1._1, x._1._2,x._2))<\code> – Logic_Problem_42 Apr 19 '18 at 09:40
The error is "value ReduceByKey is not a member of org.apache.spark.SQL.Dataset." – Logic_Problem_42 Apr 19 '18 at 09:46
You probably should use the `.rdd.` method to keep working with RDD's like the rest of the tutorial. – Tom Lous Apr 19 '18 at 09:48
What I really don't understand here - has it to be so complicated? I only want to create a simple graph from a simple table. It should be possible to do this without reduceByKey and similar things. – Logic_Problem_42 Apr 19 '18 at 09:49
Thank You, but which rdd. method do You mean? – Logic_Problem_42 Apr 19 '18 at 09:53
The reduceByKey is necessary to make sure that duplicate flights are represented only as 1 edge with count attribute. You could just use distinct and ignore the count. Or do nothing at all and get stuck with duplicated edges. – Tom Lous Apr 19 '18 at 11:14
Event to convert a 'simple' table into a 'simple' graph, you still have to explain what all the vertices are in your table (they are split over 2-columns, not so simple after all) and then tell the graph to get all *unique* routes from this table into the graph. – Tom Lous Apr 19 '18 at 11:23
Thank You for the explanation, it helped a lot. Sorry that I am so dumb, this topic is really difficult to get in for someone without programming experience like me. – Logic_Problem_42 Apr 19 '18 at 11:28
Sorry, I have a new question. I see that the graph has VertexId-s now, some strange numbers. Why do we need them? All the matching works with strings as well, at least in "normal" environments such as SQL. Is it somehow more efficient with numbers? – Logic_Problem_42 Apr 19 '18 at 11:31
Don't say that you are dumb. It's lack of specific knowledge and that's why this site exists. Just remember that if it looks complicated, there *most of the time* is a good reason for that. Graphs can be very powerful tools, but if you just want to do some aggregation & counting there are easier ways to do that. Good luck in trying to experiment with this subject. – Tom Lous Apr 19 '18 at 11:32
The internal graph computation works with VertexIds only (Longs) they could technically just be incremental numbers or so, but never Strings. Mumerhash converts Strings to Ints preserving uniqueness and adding distribution – Tom Lous Apr 19 '18 at 11:35
If you like the answer, can you please accept it / up vote it? – Tom Lous Apr 19 '18 at 12:53
Thank You, I like the answer. But my vote doesn't count because I have not enough reputation by now. – Logic_Problem_42 Apr 19 '18 at 13:06

Cannot create graph in GraphX (Scala Spark)

1 Answers1