How can I speed up SPARQL query from Python RDFLib?

Question

I have loaded a large RDF dataset (Geonames dataset: 18GB) in PostgreSQL tables using rdflib_sqlalchemy.SQLAlchemy.

I have run following simple query from Python script with RDFLib support. It has taken more than two hours to give me the result. Is there any way to make it faster without injecting RDF data to a triplestore (e.g., Virtuoso)?

mystore = store.SQLAlchemy(configuration="postgresql://localhost:5873/postgres")
g = Graph(mystore, identifier="test")
results = g.query("""SELECT ?s ?p ?o WHERE {?s ?p ?o .} LIMIT 1""")
for row in results:
  print row

I am working on a cluster's compute node. I have tried to execute my query with in-memory data like following as well. However, still, it is slow.

g = Graph()
g.parse('geonames.nt', format='nt')
results = g.query("""SELECT ?s ?p ?o WHERE {?s ?p ?o .} LIMIT 1""")
for row in results:
  print row

Please let me know your opinion. Thank you for your help.

which rdflib version is this? also please consider opening a bugreport on https://github.com/RDFLib/rdflib-sqlalchemy/issues — Jörn Hees, Feb 05 '17 at 15:10
If you want to use a relational database for such a big dataset, I recommend using [RDF2X](https://github.com/Merck/rdf2x) to convert the RDF triples to regular PostgreSQL tables with a relational schema. — David Příhoda, Mar 20 '18 at 11:36
BTW, it seems that `? rdf:type ?o` should be slightly faster. Related: https://stackoverflow.com/q/49589527/7879193 — Stanislav Kralin, Apr 01 '18 at 22:12

score 1 · Answer 1 · answered Jan 24 '17 at 19:11

Profile your code, very likely what it's slow is the loading of all those data, since the query is very simple and has LIMIT 1.

Usually, data sets of that size are managed via some proper triple store, where data can be persisted and often indexed too, which speeds up queries.

Moreover, systems like Virtuoso support parallel loading. Splitting the initial data file somehow (depends on what they represent) and then storing two or more subsets into multiple triple stores might be another approach (which could be done even if you decide to keep in-memory loading).

Multiple graphs in the same triple store might help too.

How can I speed up SPARQL query from Python RDFLib?

1 Answers1