Loading 220 milion triples in Anzograph

Question

I've got a dataset with 220 milion triples, in one TTL file. Is there a way I can upload this data into AnzoGraph?

In the AnzoGraph documentation, https://docs.cambridgesemantics.com/anzograph/userdoc/load-reqs.htm, I came across the text below:

AnzoGraph supports a maximum URI length of 16K characters. There is also a limit of 64K on the number of unique URIs you can load into AnzoGraph. That is, the number of unique URIs, including graph URIs and predicate URIs, that you can load into AnzoGraph must be less than 64K. If you exceed this limit, the Load operation exceeding the limit will fail and AnzoGraph returns the message "m_lowest_unused_index <= a_max_value()".

With 64K of unique triples, I'm expecting the upload of 220 milion triples to fail. Especially since it's a linking dataset, linking multiple sources, so lot's of unique URI's.

Is there a way around this limitation?

Given that AnzoGraph advertises itself as "massively scalable" I think this must be a typo of some sort. I mean, 64K unique URIs is nothing. — Jeen Broekstra, Jul 31 '20 at 09:12
Personally I've never seen a limitation like this on any triple-store. So if they decide to mention this warning, it's likely to be correct. About the massively scalable abilities of AnzoGraph, I interpreted the limitation as a per upload limitation. Meaning one can do as many uploads as he or she wants as long as each of these uploads doesn't break the 64k limit. — Richard Nagelmaeker, Aug 01 '20 at 10:14
Appears the text needs to be updated and corrected. The 64K is the sum of the number of distinct predicates and graph URIs. It doesn't apply to all URIs. Also the 64k limit is not per load. It is is per running instance. — Wayne W, Aug 04 '20 at 20:40
Ahh, clear. Luckily the dataset is one graph, so this shouldn't be a problem. — Richard Nagelmaeker, Aug 05 '20 at 21:09

Sean Martin · Answer 1 · 2020-09-07T03:36:51.767

220 milion triples, in one TTL file.

This approach will load your TTL data very slowly because you will be engaging just a single CPU core to ingest the data. If you can load the data just once into e.g. <yourgraph>, then use the command

`COPY <yourgraph> TO <dir:/mydir/myfiles.ttl.gz>`

which will split your dataset into many gzip compressed TTL files and next time load the data MPP style from that data directory instead, using every single C{U core in your AnzoGraph server/cluster to load sub-sets of the data in parallel. I should also note that 220m triples is actually a very small data set for AnzoGraph. I have loaded over 100m on my T470s Thinkpad while just fiddling around and single server-class systems will easily handle into the billions, while a large cluster has been tested to over a trillion with a record-breaking LUBM some years ago. Typical production use cases are in the 10's of billions.

Disclaimer: I work for Cambridge Semantics.

Loading 220 milion triples in Anzograph

1 Answers1