I need to load a subset of the DBPedia graph into iGraph in order to compute some graph statistics (such as node centrality, ...). I load DBPedia triples using the Redlands libRDF python library. Each node is associated with an URI (unique identifier).
I have some trouble loading the graph into iGraph. This is what I do:
1) Read a triple line (subject, predicate, object)
2) Use the following algorithm to get or create a vertex (with attribute)
def add_or_find_vertex (self, g, uri):
try:
return g.vs.find(name=uri)
except (KeyError, ValueError):
g.add_vertex(name=uri)
return g.vs.find(name=uri)
subjVertex = self.add_or_find_vertex(self.g, subject)
objVertex = self.add_or_find_vertex(self.g, object)
self.g.add_edge(subjVertex, objVertex, uri=predicate)
The problem is that my script is very slow and I need to load 25M triples. Each node is unique but is found several time in the triple file. I thus need to perform a lookup before creating the edge. Can you tell me if the "find" method is using an index for lookups (Hashtable, ...)? What is the complexity of vertex lookups ? How would you do ?
Thank you very much