3

I would like to load wikidata into virtuoso. After days of searching I was unable to find any tutorial or at least somebody that uses virtuoso to sparql query wikidata. I would not like to spend money on a server to load 50G+ data for nothing. Why virtuoso and not blazegraph for instance? Because I'm used to use virtuoso for DBpedia.

FranMercaes
  • 151
  • 1
  • 1
  • 12
  • You load Wikidata as any other RDF dataset into Virtuoso. Or what exactly is the question here? I mean, how to load data is clearly in the Virtuoso docs: http://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoader - or do you have any other question? – UninformedUser Jun 26 '19 at 08:37
  • Actually, I am suffering with this now, since I am trying to load `latest-truthy.nt.bz2` and I am getting this error: `File latest-truthy.nt error 42000 RDFGE: RDF box with a geometry RDF type and a non-geometry content` I think I will create a separate question for it – Mohamed Gad-Elrab Jun 26 '19 at 08:48
  • This repo may be a good start: https://github.com/patrickhoefler/wikidata-virtuoso – Mohamed Gad-Elrab Jun 26 '19 at 08:52
  • The repo noted above by @MohamedGad-Elrab is LONG outdated. I suggest you follow up to [this related issue on the Virtuoso project](https://github.com/openlink/virtuoso-opensource/issues/295), or [create your own new issue there](https://github.com/openlink/virtuoso-opensource/issues/). You could also post to the [OpenLink Community Forum](http://community.openlinksw.com/). – TallTed Jun 26 '19 at 20:37

2 Answers2

1

Better late than never, here's a guide that covers the creation and deployment of a Wikidata instance using Virtuoso.

-3

As indicated here and elsewhere, loading Wikidata into Virtuoso should simply be a matter of creating the turtle file (better is to have multiple turtle files) from the download and bulk loading it in. To get decent performance, a number of parameters have to be changed in virtuoso.ini.

There is a problem loading Wikidata into Virtuoso, however, due to a long-standing bug in Virtuoso having to do with its implementation of geo-coordinates. To get around this requires patching Virtuoso, and is not for the faint of heart.

Here are the instructions on what do to get the open-source version of Virtuoso to load Wikidata. Note that the patching of the geo-coordinate code might cause problems using the resultant KB.

1/ Patch the geo-coordinate literal code, editing virtuoso-opensource/libsrc/Wi/rdfbox.c to comment out two pieces of code that check for non-terrestrial coordinates. Note that this is a bug in Virtuoso and that Wikidata conforms to the specification of this datatype.

/*non-terrestrial cooordinates if (RDF_BOX_GEO_TYPE == type && DV_GEO != box_dtp && DV_LONG_INT != box_dtp) sqlr_new_error ("42000", "RDFGE", "RDF box with a geometry RDF type and a non-geometry content"); */

/*non-terrestrial cooordinates if (type == RDF_BOX_GEO && box_dtp != DV_GEO) sqlr_new_error ("22023", "SR559", "The RDF box of type geometry needs a spatial object as a value, not a value of type %s (%d)", dv_type_title (box_dtp), box_dtp); */

2/ Patch the Turtle loader, editing the end of rdf_rl_lang_id in virtuoso-opensource/libsrc/Wi/ttlpv.sql to look as follows. Note that this is another bug in Virtuoso that is triggered by parallel loading of langstrings with different languages.

id:= sequence_next ('RDF_LANGUAGE_TWOBYTE', 1, 1); --pfps insert into rdf_language (rl_twobyte, rl_id) values (id, ln); insert soft rdf_language (rl_twobyte, rl_id) values (id, ln); commit work; -- if load non transactional, this is still a sharpp transaction boundary. log_enable (old_mode, 1); --pfps get the actual id, as it may be different id := (select RL_TWOBYTE from DB.DBA.RDF_LANGUAGE where RL_ID = ln); rdf_cache_id ('l', ln, id); return id;

  • Could you please be more specific in how it can be made working? – Wolfgang Fahl May 09 '20 at 15:26
  • Note that the speed of loading Wikidata into Virtuoso is much improved if you split the Wikidata dump into multiple (ideally hundreds) of pieces. This also is not trivial. – Peter F. Patel-Schneider May 10 '20 at 16:38
  • Also, if you are loading the complete dump (with over 10 billion triples) you need a machine with at least 256GB of main memory (maybe even at least 512GB) and an SSD to store the DB. A machine with less memory will start thrashing part way through the load and progress will be *very* slow thereafter. – Peter F. Patel-Schneider May 10 '20 at 16:41
  • http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData#Test_with_Apache_Jena shows the current try with a different SPARQL store. I'll test blazegraph and virtuoso after that again. The official blazegraph import also splits the triples. Still I'd appreciate to get a detailed documentation of the process to be included in the wiki mentioned above. – Wolfgang Fahl May 10 '20 at 17:34