4

I have loaded a large RDF dataset (Geonames dataset: 18GB) in PostgreSQL tables using rdflib_sqlalchemy.SQLAlchemy.

I have run following simple query from Python script with RDFLib support. It has taken more than two hours to give me the result. Is there any way to make it faster without injecting RDF data to a triplestore (e.g., Virtuoso)?

mystore = store.SQLAlchemy(configuration="postgresql://localhost:5873/postgres")
g = Graph(mystore, identifier="test")
results = g.query("""SELECT ?s ?p ?o WHERE {?s ?p ?o .} LIMIT 1""")
for row in results:
  print row

I am working on a cluster's compute node. I have tried to execute my query with in-memory data like following as well. However, still, it is slow.

g = Graph()
g.parse('geonames.nt', format='nt')
results = g.query("""SELECT ?s ?p ?o WHERE {?s ?p ?o .} LIMIT 1""")
for row in results:
  print row  

Please let me know your opinion. Thank you for your help.

Beautiful Mind
  • 5,828
  • 4
  • 23
  • 42

1 Answers1

1

Profile your code, very likely what it's slow is the loading of all those data, since the query is very simple and has LIMIT 1.

Usually, data sets of that size are managed via some proper triple store, where data can be persisted and often indexed too, which speeds up queries.

Moreover, systems like Virtuoso support parallel loading. Splitting the initial data file somehow (depends on what they represent) and then storing two or more subsets into multiple triple stores might be another approach (which could be done even if you decide to keep in-memory loading).

Multiple graphs in the same triple store might help too.

zakmck
  • 2,715
  • 1
  • 37
  • 53