4

I've been experimenting with Titan over the past few weeks and would like some pointers on the way forward, plus a few specific questions. The purpose of the project is to store log data on a Cassandra cluster (for this question let's use the example of web traffic) and represent relationships in a Titan graph. All nodes are modelled as having an entity value and type (e.g. "google.com","hostname"), and edges have a label (e.g. "connects") as well as several attributes of the relationship (timestamp, flow length and so on).

Once this data is stored in cassandra and represented as a Titan graph, I plan to use d3 code to generate visualisations. At the end of the tunnel I am hoping to be able to build large-scale, interactive, complex graph networks that look something like this: http://goo.gl/CVEd55

My current setup is as follows:

  • A python script to convert log files into vertices.csv and edges.csv files for Gremlin to load in
  • Titan Server 0.4 (using CassandraThrift as the storage backend) - gremlin script to load converted data into Titan
  • Python script that uses NetworkX to open a RexPro connection, allowing the analyst to enter a custom Gremlin query, outputting the result as a JSON
  • Local web front-end that uses the generated JSON and d3 to display the results of the query as a graph

Ideally as a test base case, I would like the user to be able to type a Gremlin query into the web front-end and be directed to a page containing an interactive d3 graph of the result.

My specific questions are are follows:

  1. What is the process for assigning attributes to edges? I have had trouble finding sample code that helps me represent the graph using the model listed above.

  2. My gremlin script to load data into Titan uses bg.commit() to create a batch graph which is later referenced in the RexPro connection conn= RexProConnection('localhost,8184,'bg'). This was working originally but after changing my load script, clearing the graph in Gremlin and then reloading, the RexPro connection cannot be opened due to the graph bg apparently not existing. What is the process of updating graphs in Titan? Presumably running a load script twice using the same graph will only add nodes/vertices to the existing one, so how would I go about generating a new graph with the same name every time I update my model, and have RexPro be able to reference it when running a query?

  3. How easy would it be to extend the interface to allow an analyst to enter SQL queries into the front end, using RexPro to access the graph in a similar way to the one described?

Apologies for the long post, but if anyone could share their expertise that would be much appreciated!

GenericJon
  • 8,746
  • 4
  • 39
  • 50
adaml288
  • 61
  • 1
  • 6

2 Answers2

1

For d3 visualization, you can use force directed graph. There are a few variations of them.

Relationship Graph https://vida.io/documents/qZ5SJdRJfj3XmSXYJ

Force Layout Tree https://vida.io/documents/sy7vzWW7BJEvKdZeL

If your network contains a large number of node and edges, you'll need to cluster data before visualizing. You can use tools like Gephi, NodeXL to perform clustering. Then use clustered data to build force directed visualization.

Phuoc Do
  • 1,344
  • 10
  • 14
0

What is the process for assigning attributes to edges?

The process is the same as adding properties to vertices. Get an Edge instance then do:

Edge e = g.addEdge(v1,v2,'label')
e.setProperty('weight',0.1d)

As for:

What is the process of updating graphs in Titan? Presumably running a load script twice using the same graph will only add nodes/vertices to the existing one, so how would I go about generating a new graph with the same name every time I update my model, and have RexPro be able to reference it when running a query?

You don't want a reference to a BatchGraph after loading as it comes with limitations that will prevent you from querying. It sounds like you should just configure "yourgraph" in rexster.xml, when you load through your script, simply wrap your rexster.xml configured Graph in your code, and perform your load operations against it. When you want to query it, simply reference "yourgraph" instead of "bg".

conn = RexProConnection('localhost,8184,'yourgraph')

How easy would it be to extend the interface to allow an analyst to enter SQL queries into the front end, using RexPro to access the graph in a similar way to the one described?

It's hard to say if that's "easy" as that depends on factors outside of just the technology. I'll say that it's possible to to build an interface that accepts Gremlin queries (your wrote SQL, but I assume you meant Gremlin), passes them to Rexster and gets back an answer. What you do with that answer is up to you, but as far as Rexster's part plays into it, I don't see why that would be a problem.

stephen mallette
  • 45,298
  • 5
  • 67
  • 135
  • hi Stephen, thanks for the reply! The properties on edges explanation is really useful. As for the batch graphs - I currently have some code in a Gremlin load script that creates a graph g, followed by a batch graph using g. All the data loading is done using the bg functions but in my Python NetworkX visualisation script, the queries are done using g. This seems to work fine. Am I right in saying that once you create a batch graph from a normal one, that updating the references to g also updates bg? – adaml288 Nov 28 '13 at 10:10
  • Continuing from the last comment as I ran out of lines: Apologies if my third question confused you slightly. What I meant was that I already have an interface that lets the user input a Gremlin query on the web front end in order to generate a graph based on the query. I've done this by making a RexPro connection and passing the query through. What I'm in fact looking for is something that allows analysts who may not be familiar with Gremlin to be able to write simple queries to return data in a similar way. So either CQL, SPARQL or similar. Cheers again. – adaml288 Nov 28 '13 at 10:13
  • If you create `g` without defining it in `rexster.xml` it won't be available outside of that RexPro session (i assume you are using sessionful communication over RexPro). If that's ok, then yes, you should be able to re-use the `g` reference (don't use `bg` for queries as it won't work properly. – stephen mallette Nov 28 '13 at 11:24
  • It's possible to pass SPARQL I guess, but you'd need to write some kind of custom function that processed it. In other words, you'd have to write a function that Rexster knew about like `executeSparql`, then when a user entered a SPARQL query you'd have to wrap that query inside of that function. I think that would work, but have never tried anything like that. – stephen mallette Nov 28 '13 at 11:27