0

I'm looking for a way to query against a RDF file formatted in Turtle syntax. The RDF file is actually the whole Wikipedia categories hierarchy, provided by Wikidata.

Here is an extract from the contents of the file enwiki categories.ttl, showing the global structure of the data :

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix mediawiki: <https://www.mediawiki.org/ontology#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix schema: <http://schema.org/> .
@prefix cc: <http://creativecommons.org/ns#> .

<https://en.wikipedia.org/wiki/Category:1148_establishments_in_France> a mediawiki:Category ;
    rdfs:label "1148 establishments in France" ;
    mediawiki:pages "2"^^xsd:integer ;
    mediawiki:subcategories "0"^^xsd:integer .

<https://en.wikipedia.org/wiki/Category:1148_establishments_in_France> mediawiki:isInCategory <https://en.wikipedia.org/wiki/Category:1140s_establishments_in_France>,
        <https://en.wikipedia.org/wiki/Category:1148_establishments_by_country>,
        <https://en.wikipedia.org/wiki/Category:1148_establishments_in_Europe>,
        <https://en.wikipedia.org/wiki/Category:1148_in_France>,
        <https://en.wikipedia.org/wiki/Category:Establishments_in_France_by_year> .

My final goal is to be able to retrieve all parent categories of a Wikipedia category by querying the RDF Turtle file. Here is a very short Java code example showing my issue :

LogCtl.setCmdLogging();
Model model = ModelFactory.createDefaultModel();
model.read("enwiki-categories.ttl");

The RDF Turtle file is well over 850 MB, loading the model using the previously shown code causes an out of memory error. I need a way to query against the RDF file without having to load the full RDF database in memory.

--

Is there a way to do this using Apache Jena or another library ?

If not, is there a faster way to retrieve all parent categories from a given category in Wikipedia, using local data ?

  • Why do you need locally store Wikidata when you have efficient SPARQL Endpoint? URL of the endpoint: https://query.wikidata.org/sparql?query={SPARQL} – Gilles-Antoine Nys May 16 '18 at 12:21
  • That's a good question : I have built a parser that successfully manages to extract the terms from all of the 5 million wikipedia articles, but the dataset created is too big. I am now looking for a way to select the retrieved data by parent categories. For example, if I select the parent cateory "Science", when the parser finds the category "Mammal taxonomy" in an article, it should be able to climb back the hierarchy tree to find the root "Science", and deduce from that that the article has to be selected. Using the API would add a 200ms latency to each article read : I can't do it this way. – Jean-Pierre Coffe May 16 '18 at 12:36
  • @Gilles-AntoineNys If you have any idea on how to do this better that would help me a ton, I'm a bit stuck right now ! – Jean-Pierre Coffe May 16 '18 at 12:48
  • What you intend to do is called "Broader Concept". For instance, the broader concept of a Tree is Plant. And Tree is the broader concept of Pine or Oak... It is formalised in SKOS (skos:broader). – Gilles-Antoine Nys May 16 '18 at 13:05
  • The problem is I don't know if Wikidata implements SKOS as DBPedia does. – Gilles-Antoine Nys May 16 '18 at 13:06
  • Thanks for that, i'm going to make researches on this SKOS standard. Merci beaucoup de m'avoir accordé votre temps ! – Jean-Pierre Coffe May 16 '18 at 13:11
  • To avoid the OOME, either increase the heap size, if practical, or read the data into a persistent Apache Jena TDB database (which avoids reloading every time the program runs). – AndyS May 16 '18 at 14:25
  • @AndyS I've just read about Apache Jena TDB and I was wondering whether or not this was the technical solution to my issue. Now I think that might be it ! Thanks for your help. – Jean-Pierre Coffe May 16 '18 at 14:39

2 Answers2

1

Yes, you can do the query with Jena. It is exactly what Jena is designed to do. I would however suggest you import the file into an RDF data store and then use Jena to do an SPARQL query against the RDF data store.

You may want to see my answer to a related question on SO where I give some references to RDF data stores.

Henriette Harmse
  • 4,167
  • 1
  • 13
  • 22
1

What you intend to do is called "Broader Concept".

It is formalised in SKOS (skos:broader). Here is the link to the documentation : SKOS

The definition of SKOS is :

Simple Knowledge Organization System (SKOS) is a common data model for sharing and linking knowledge organization systems via the Web.

For instance, the broader concept of a Tree is Plant. And Tree is the broader concept of Pine or Oak... It is formalised in SKOS (skos:broader).

Gilles-Antoine Nys
  • 1,481
  • 16
  • 21