Querying against a Wikipedia RDF file (Turtle format) with Apache Jena

Question

I'm looking for a way to query against a RDF file formatted in Turtle syntax. The RDF file is actually the whole Wikipedia categories hierarchy, provided by Wikidata.

Here is an extract from the contents of the file enwiki categories.ttl, showing the global structure of the data :

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix mediawiki: <https://www.mediawiki.org/ontology#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix schema: <http://schema.org/> .
@prefix cc: <http://creativecommons.org/ns#> .

<https://en.wikipedia.org/wiki/Category:1148_establishments_in_France> a mediawiki:Category ;
    rdfs:label "1148 establishments in France" ;
    mediawiki:pages "2"^^xsd:integer ;
    mediawiki:subcategories "0"^^xsd:integer .

<https://en.wikipedia.org/wiki/Category:1148_establishments_in_France> mediawiki:isInCategory <https://en.wikipedia.org/wiki/Category:1140s_establishments_in_France>,
        <https://en.wikipedia.org/wiki/Category:1148_establishments_by_country>,
        <https://en.wikipedia.org/wiki/Category:1148_establishments_in_Europe>,
        <https://en.wikipedia.org/wiki/Category:1148_in_France>,
        <https://en.wikipedia.org/wiki/Category:Establishments_in_France_by_year> .

My final goal is to be able to retrieve all parent categories of a Wikipedia category by querying the RDF Turtle file. Here is a very short Java code example showing my issue :

LogCtl.setCmdLogging();
Model model = ModelFactory.createDefaultModel();
model.read("enwiki-categories.ttl");

The RDF Turtle file is well over 850 MB, loading the model using the previously shown code causes an out of memory error. I need a way to query against the RDF file without having to load the full RDF database in memory.

--

Is there a way to do this using Apache Jena or another library ?

If not, is there a faster way to retrieve all parent categories from a given category in Wikipedia, using local data ?

Why do you need locally store Wikidata when you have efficient SPARQL Endpoint? URL of the endpoint: https://query.wikidata.org/sparql?query={SPARQL} — Gilles-Antoine Nys, May 16 '18 at 12:21
That's a good question : I have built a parser that successfully manages to extract the terms from all of the 5 million wikipedia articles, but the dataset created is too big. I am now looking for a way to select the retrieved data by parent categories. For example, if I select the parent cateory "Science", when the parser finds the category "Mammal taxonomy" in an article, it should be able to climb back the hierarchy tree to find the root "Science", and deduce from that that the article has to be selected. Using the API would add a 200ms latency to each article read : I can't do it this way. — Jean-Pierre Coffe, May 16 '18 at 12:36
@Gilles-AntoineNys If you have any idea on how to do this better that would help me a ton, I'm a bit stuck right now ! — Jean-Pierre Coffe, May 16 '18 at 12:48
What you intend to do is called "Broader Concept". For instance, the broader concept of a Tree is Plant. And Tree is the broader concept of Pine or Oak... It is formalised in SKOS (skos:broader). — Gilles-Antoine Nys, May 16 '18 at 13:05
The problem is I don't know if Wikidata implements SKOS as DBPedia does. — Gilles-Antoine Nys, May 16 '18 at 13:06
Thanks for that, i'm going to make researches on this SKOS standard. Merci beaucoup de m'avoir accordé votre temps ! — Jean-Pierre Coffe, May 16 '18 at 13:11
To avoid the OOME, either increase the heap size, if practical, or read the data into a persistent Apache Jena TDB database (which avoids reloading every time the program runs). — AndyS, May 16 '18 at 14:25
@AndyS I've just read about Apache Jena TDB and I was wondering whether or not this was the technical solution to my issue. Now I think that might be it ! Thanks for your help. — Jean-Pierre Coffe, May 16 '18 at 14:39

score 1 · Answer 1 · answered May 16 '18 at 12:23

1

Yes, you can do the query with Jena. It is exactly what Jena is designed to do. I would however suggest you import the file into an RDF data store and then use Jena to do an SPARQL query against the RDF data store.

You may want to see my answer to a related question on SO where I give some references to RDF data stores.

answered May 16 '18 at 12:23

Henriette Harmse

4,167
1
13
22

I am confused on the purpose to duplicate Wikidata which is already well open through its SPARQL Endpoint. – Gilles-Antoine Nys May 16 '18 at 13:07
1

There can various benefits to having a local copy. I.e. you have better control over connectivity, access to query plans, index optimization for your specific queries, etc. – Henriette Harmse May 16 '18 at 13:23
Over 50M of pages in Wikidata. You need to have a strong architecture. – Gilles-Antoine Nys May 16 '18 at 13:32

score 1 · Accepted Answer · answered May 16 '18 at 13:13

What you intend to do is called "Broader Concept".

It is formalised in SKOS (skos:broader). Here is the link to the documentation : SKOS

The definition of SKOS is :

Simple Knowledge Organization System (SKOS) is a common data model for sharing and linking knowledge organization systems via the Web.

For instance, the broader concept of a Tree is Plant. And Tree is the broader concept of Pine or Oak... It is formalised in SKOS (skos:broader).

Querying against a Wikipedia RDF file (Turtle format) with Apache Jena

2 Answers2