8

I have a large rdf file:

  • size: 470MB
  • number of lines: almost 6 million
  • unique triple subjects: about 650,000
  • triple amount: about 4,200,000

I loaded the rdf definition into the berkeley db backend of rdflib via:

graph = rdflib.Graph("Sleepycat")
graph.open("store", create=True)
graph.parse("authorities-geografikum_lds.rdf")

It took many hours to complete on my notebook. The computer isn't really powerful (Intel B980 CPU, 4GB of RAM, no SSD) and the definition is large - but still, many hours for this task seems rather long. Maybe it is partly due to indexing / optimizing the data structures?

What is really irritating is the time it takes for the following queries to complete:

SELECT (COUNT(DISTINCT ?s) as ?c)
WHERE {
    ?s ?p ?o
}

(Result: 667,445)

took over 20 minutes and

SELECT (COUNT(?s) as ?c)
WHERE {
    ?s ?p ?o
}

(Result: 4,197,399)

took over 25 minutes.

I my experience, a Relational DBMS filled with comparable data would finish a corresponding query in a small fraction of the time given appropriate indexing.

So my questions are:

Why is rdflib so slow (especially for queries)?

Can I tune / optimize the database, like I can with indexes in a RDBMS?

Is another (free and "compact") triple store better suited for data of this size, performance-wise?

  • the question would be, why using `rdflib` on top of a relational database instead of a "proper" triple store? There are some open source, e.g. Apache Jena Fuseki, Virtuoso, etc. – UninformedUser Jun 12 '19 at 16:36
  • regarding your question, I doubt any index is used when the query takes 20min to complete. But that's something the devs can answer better – UninformedUser Jun 12 '19 at 16:38
  • I looked into the implementation, and I think your query is horrible for it. I mean, it's not a store which does SPARQL to SQL rewriting but implements an iterator model + some indices in the DB. So it has to get all triples and then do the count in-memory. But sure, it still looks a bit slow. – UninformedUser Jun 12 '19 at 16:55
  • Here is some related issue: https://github.com/RDFLib/rdflib/issues/787 – UninformedUser Jun 12 '19 at 16:55
  • Thank you for your answers. My resulting question is: Why use rdflib at all with berkeley db, if a main use case of rdflib is storing and querying triples, and rdflib with berkeley db is obviously not suited for it? – Johann Gottfried Jun 13 '19 at 10:24
  • 3
    In the early days of RDF there were no native RDF stores yet. The first RDF stores were built on top of existing storage engines, such as SQL databases and BDB. The rdflib implementation goes back to these early days. This is now an obsolete approach, as native stores offer much better performance and full SPARQL compliance. (Virtuoso is an interesting outlier here; AIUI its RDF store today is still a highly tuned relational engine, and actually has great performance.) – cygri Jun 13 '19 at 12:37
  • So, it makes sense to always use an external RDF store, right? – Johann Gottfried Jun 14 '19 at 08:18
  • And would you still recommend rdflib to fill and query an external store (when using python)? Or are there better alternatives (with or without python)? – Johann Gottfried Jun 14 '19 at 08:20

1 Answers1

1

I experienced a similar slow behavior of RDFLIB. For me, a possible solution lay in changing the underyling graph storage to Oxrdflib, which improved the speed of the SPARQL-query drastically.

see: https://pypi.org/project/oxrdflib/