Why is rdflib so slow?

Question

I have a large rdf file:

size: 470MB
number of lines: almost 6 million
unique triple subjects: about 650,000
triple amount: about 4,200,000

I loaded the rdf definition into the berkeley db backend of rdflib via:

graph = rdflib.Graph("Sleepycat")
graph.open("store", create=True)
graph.parse("authorities-geografikum_lds.rdf")

It took many hours to complete on my notebook. The computer isn't really powerful (Intel B980 CPU, 4GB of RAM, no SSD) and the definition is large - but still, many hours for this task seems rather long. Maybe it is partly due to indexing / optimizing the data structures?

What is really irritating is the time it takes for the following queries to complete:

SELECT (COUNT(DISTINCT ?s) as ?c)
WHERE {
    ?s ?p ?o
}

(Result: 667,445)

took over 20 minutes and

SELECT (COUNT(?s) as ?c)
WHERE {
    ?s ?p ?o
}

(Result: 4,197,399)

took over 25 minutes.

I my experience, a Relational DBMS filled with comparable data would finish a corresponding query in a small fraction of the time given appropriate indexing.

So my questions are:

Why is rdflib so slow (especially for queries)?

Can I tune / optimize the database, like I can with indexes in a RDBMS?

Is another (free and "compact") triple store better suited for data of this size, performance-wise?

the question would be, why using `rdflib` on top of a relational database instead of a "proper" triple store? There are some open source, e.g. Apache Jena Fuseki, Virtuoso, etc. — UninformedUser, Jun 12 '19 at 16:36
regarding your question, I doubt any index is used when the query takes 20min to complete. But that's something the devs can answer better — UninformedUser, Jun 12 '19 at 16:38
I looked into the implementation, and I think your query is horrible for it. I mean, it's not a store which does SPARQL to SQL rewriting but implements an iterator model + some indices in the DB. So it has to get all triples and then do the count in-memory. But sure, it still looks a bit slow. — UninformedUser, Jun 12 '19 at 16:55
Here is some related issue: https://github.com/RDFLib/rdflib/issues/787 — UninformedUser, Jun 12 '19 at 16:55
Thank you for your answers. My resulting question is: Why use rdflib at all with berkeley db, if a main use case of rdflib is storing and querying triples, and rdflib with berkeley db is obviously not suited for it? — Johann Gottfried, Jun 13 '19 at 10:24
In the early days of RDF there were no native RDF stores yet. The first RDF stores were built on top of existing storage engines, such as SQL databases and BDB. The rdflib implementation goes back to these early days. This is now an obsolete approach, as native stores offer much better performance and full SPARQL compliance. (Virtuoso is an interesting outlier here; AIUI its RDF store today is still a highly tuned relational engine, and actually has great performance.) — cygri, Jun 13 '19 at 12:37
So, it makes sense to always use an external RDF store, right? — Johann Gottfried, Jun 14 '19 at 08:18
And would you still recommend rdflib to fill and query an external store (when using python)? Or are there better alternatives (with or without python)? — Johann Gottfried, Jun 14 '19 at 08:20

score 1 · Answer 1 · answered Feb 06 '22 at 14:53

1

I experienced a similar slow behavior of RDFLIB. For me, a possible solution lay in changing the underyling graph storage to Oxrdflib, which improved the speed of the SPARQL-query drastically.

see: https://pypi.org/project/oxrdflib/

answered Feb 06 '22 at 14:53

achiminator

11
2

Why is rdflib so slow?

1 Answers1