I'm looking for a way to query against a RDF file formatted in Turtle syntax. The RDF file is actually the whole Wikipedia categories hierarchy, provided by Wikidata.
Here is an extract from the contents of the file enwiki categories.ttl, showing the global structure of the data :
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix mediawiki: <https://www.mediawiki.org/ontology#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix schema: <http://schema.org/> .
@prefix cc: <http://creativecommons.org/ns#> .
<https://en.wikipedia.org/wiki/Category:1148_establishments_in_France> a mediawiki:Category ;
rdfs:label "1148 establishments in France" ;
mediawiki:pages "2"^^xsd:integer ;
mediawiki:subcategories "0"^^xsd:integer .
<https://en.wikipedia.org/wiki/Category:1148_establishments_in_France> mediawiki:isInCategory <https://en.wikipedia.org/wiki/Category:1140s_establishments_in_France>,
<https://en.wikipedia.org/wiki/Category:1148_establishments_by_country>,
<https://en.wikipedia.org/wiki/Category:1148_establishments_in_Europe>,
<https://en.wikipedia.org/wiki/Category:1148_in_France>,
<https://en.wikipedia.org/wiki/Category:Establishments_in_France_by_year> .
My final goal is to be able to retrieve all parent categories of a Wikipedia category by querying the RDF Turtle file. Here is a very short Java code example showing my issue :
LogCtl.setCmdLogging();
Model model = ModelFactory.createDefaultModel();
model.read("enwiki-categories.ttl");
The RDF Turtle file is well over 850 MB, loading the model using the previously shown code causes an out of memory error. I need a way to query against the RDF file without having to load the full RDF database in memory.
--
Is there a way to do this using Apache Jena or another library ?
If not, is there a faster way to retrieve all parent categories from a given category in Wikipedia, using local data ?