1

I am trying to construct Gremlin queries to use within DSE Graph with geo-searches enabled (indexed in Solr). The problem is the graph is so densely interconnected that the cyclic path traversals time out. Right now the prototype graph I'm working with has ~1600 vertices and ~35K edges. The number of triangles passing through each vertex also summarised:

+--------------------+-----+                                                    
|                 gps|count|
+--------------------+-----+
|POINT (-0.0462032...| 1502|
|POINT (-0.0458048...|  405|
|POINT (-0.0460680...|  488|
|POINT (-0.0478356...| 1176|
|POINT (-0.0479465...| 5566|
|POINT (-0.0481031...| 9896|
|POINT (-0.0484724...|  433|
|POINT (-0.0469379...|  302|
|POINT (-0.0456595...|  394|
|POINT (-0.0450722...|  614|
|POINT (-0.0475904...| 3080|
|POINT (-0.0479464...| 5566|
|POINT (-0.0483400...|  470|
|POINT (-0.0511753...|  370|
|POINT (-0.0521901...| 1746|
|POINT (-0.0519999...| 1026|
|POINT (-0.0468071...| 1247|
|POINT (-0.0469636...| 1165|
|POINT (-0.0463685...|  526|
|POINT (-0.0465805...| 1310|
+--------------------+-----+
only showing top 20 rows

I anticipate the graph growing to a massive size eventually but I will limit the searches for cycles to geographic regions (say of radius ~ 300 meters).

My best attempt so far has been some versions of the following:

g.V().has('gps',Geo.point(lon, lat)).as('P')
.repeat(both()).until(cyclicPath()).path().by('gps')

Script evaluation exceeded the configured threshold of realtime_evaluation_timeout at 180000 ms for the request

For the sake of illustration, the map below shows a starting vertex in green and a terminating vertex in red. Assume that all the vertices are interconnected. I am interested in the longest path between green and red, which would be to circumnavigate the block. enter image description here

A few links I've read through to no avail:

1) http://tinkerpop.apache.org/docs/current/recipes/#cycle-detection

2) Longest acyclic path in a directed unweighted graph

3) https://groups.google.com/forum/#!msg/gremlin-users/tc8zsoEWb5k/9X9LW-7bCgAJ

EDIT

Using Daniel's suggestion below to create a subgraph, it still times out:

gremlin> hood = g.V().hasLabel('image').has('gps', Geo.inside(point(-0.04813968113126384, 51.531259899256995), 100, Unit.METERS)).bothE().subgraph('hood').cap('hood').next()
==>tinkergraph[vertices:640 edges:28078]
gremlin> hg = hood.traversal()
==>graphtraversalsource[tinkergraph[vertices:640 edges:28078], standard]
gremlin> hg.V().has('gps', Geo.point(-0.04813968113126384, 51.531259899256995)).as('x')
==>v[{~label=image, partition_key=2507574903070261248, cluster_key=RFAHA095CLK-2017-09-14 12:52:31.613}]
gremlin> hg.V().has('gps', Geo.point(-0.04813968113126384, 51.531259899256995)).as('x').repeat(both().simplePath()).emit(where(both().as('x'))).both().where(eq('x')).tail(1).path()
Script evaluation exceeded the configured threshold of realtime_evaluation_timeout at 180000 ms for the request: [91b6f1fa-0626-40a3-9466-5d28c7b5c27c - hg.V().has('gps', Geo.point(-0.04813968113126384, 51.531259899256995)).as('x').repeat(both().simplePath()).emit(where(both().as('x'))).both().where(eq('x')).tail(1).path()]
Type ':help' or ':h' for help.
Display stack trace? [yN]n
arjology
  • 51
  • 6

2 Answers2

4

The longest path, based on the number of hops, will be the last one you can find.

g.V().has('gps', Geo.point(x, y)).as('x').
  repeat(both().simplePath()).
    emit(where(both().as('x'))).
  both().where(eq('x')).tail(1).
  path()

There's no way to make this query perform well in OLTP, unless you have a very tiny (sub)graph. So, depending on what you see as a "city block" in your graph, you should probably extract that first as a subgraph and then apply the longest path query (in memory).

Daniel Kuppitz
  • 10,846
  • 1
  • 25
  • 34
  • awesome thanks Daniel. I've seen your post on a similar (if not same) query before; but I think I have no choice but doing it as an OLTP which I'm not sure how to achieve... – arjology Mar 02 '18 at 15:58
  • awesome thanks Daniel. I've seen your post on a similar (if not same) query before; but I think I have no choice but doing it as an OLTP which I'm not sure how to achieve... When using spark, for example: `graph.V.hasLabel("image").has("gps", Geo.inside(Geo.point(-0.047167, 51.532042), 300, Geo.Unit.METERS)).count().show()` throws an exception about task not serializable (`Caused by: java.io.NotSerializableException: com.datastax.shaded.esri.ogc.OGCPoint`) – arjology Mar 02 '18 at 16:03
0

One solution I've come up with involves using Spark GraphFrames and a label propagation algorithm (GraphFrames, LPA). Each community's average GPS location can then be computed (in fact you don't even need the average, simply a single member of each community would suffice) and all the edges that exist between each community member representative (average or otherwise).

Select and save a region of the graph and save the vertices and edges:

g.V().has('gps', Geo.inside(Geo.point(x,y), radius, Unit.METERS))
.subgraph('g').cap(g')

Spark snippet:

import org.graphframes.GraphFrame

val V = spark.read.json("v.json")
val E = spark.read.json("e.json")
val g = GraphFrame(V,E)
val result = g.labelPropagation.maxIter(5).run()

val rdd = result.select("fullgps", "label").map(row => {
    val coords = row.getString(0).split(",")
    val x = coords(0).toDouble
    val y = coords(1).toDouble
    val z = coords(2).toDouble
    val id = row.getLong(1)
    (x,y,z,id)
    }).rdd

// Average GPS:
val newVertexes = rdd.map{ case (x:Double,y:Double,z:Double,id:Long) => (id, (x,y,z)) }.toDF("lbl","gps")
rdd.map{ case (x:Double,y:Double,z:Double,id:Long) => (id, (x,y,z)) }.mapValues(value => (value,1)).reduceByKey{ case (((xL:Double,yL:Double,zL:Double), countL:Int), ((xR:Double,yR:Double,zR:Double), countR:Int)) => ((xR+xL,yR+yL,zR+yL),countR+countL) }.map{ case (id,((x,y,z),c)) => (id, ((x/c,y/c,z/c),c)) }.map{ case (id:Long, ((x:Double, y:Double, z:Double), count:Int)) => Array(x.toString,y.toString,z.toString,id.toString,count.toString) }.map(a => toCsv(a)).saveAsTextFile("avg_gps.csv")

// Keep IDs
val rdd2 = result.select("id", "label").map(row => {
       val id = row.getString(0)
       val lbl = row.getLong(1)
       (lbl, id) }).rdd

val edgeDF = E.select("dst","src").map(row => (row.getString(0),row.getString(1))).toDF("dst","src")

// Src
val tmp0 = result.select("id","label").join(edgeDF, result("id") === edgeDF("src")).withColumnRenamed("lbl","src_lbl")
val srcDF = tmp0.select("src","dst","label").map(row => { (row.getString(0)+"###"+row.getString(1),row.getLong(2)) }).withColumnRenamed("_1","src_lbl").withColumnRenamed("_2","src_edge")

// Dst
val tmp1 = result.select("id","label").join(edgeDF, result("id") === edgeDF("dst")).withColumnRenamed("lbl","dst_lbl")
val dstDF = tmp1.select("src","dst","label").map(row => { (row.getString(0)+"###"+row.getString(1),row.getLong(2)) }).withColumnRenamed("_1","dst_lbl").withColumnRenamed("_2","dst_edge")

val newE = srcDF.join(dstDF, srcDF("src_lbl")===dstDF("dst_lbl"))
val newEdges = newE.filter(newE("src_edge")=!=newE("dst_edge")).select("src_edge","dst_edge").map(row => { (row.getLong(0).toString + "###" + row.getLong(1).toString, row.getLong(0), row.getLong(1)) }).withColumnRenamed("_1","edge").withColumnRenamed("_2","src").withColumnRenamed("_3","dst").dropDuplicates("edge").select("src","dst")

val newGraph = GraphFrames(newVertexes, newEdges)

The averaged locations are then connected by edges and the problem is reduced in this case from ~1600 vertices and ~35K edges to 25 vertices and 54 edges:

enter image description here Here the non-green coloured segments (red, white, black, etc.) represent the individual communities. The green circles are the averaged GPS locations and their sizes are proportional to the number of members in each community. Now it is considerably easier to perform an OLTP algorithm such as proposed by Daniel in the comment above.

arjology
  • 51
  • 6