9

I'm trying to parse several big graphs with RDFLib 3.0, apparently it handles first one and dies on the second (MemoryError)... looks like MySQL is not supported as store anymore, can you please suggest a way to somehow parse those?

Traceback (most recent call last):
  File "names.py", line 152, in <module>
    main()
  File "names.py", line 91, in main
    locals()[graphname].parse(filename, format="nt")
  File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/graph.py", line 938, in parse
    location=location, file=file, data=data, **args)
  File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/graph.py", line 757, in parse
    parser.parse(source, self, **args)
  File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/nt.py", line 24, in parse
    parser.parse(f)
  File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/ntriples.py", line 124, in parse
    self.line = self.readline()
  File "/usr/local/lib/python2.6/dist-packages/rdflib-3.0.0-py2.6.egg/rdflib/plugins/parsers/ntriples.py", line 151, in readline
    m = r_line.match(self.buffer)
MemoryError
Manuel Salvadores
  • 16,287
  • 5
  • 37
  • 56
user52028778
  • 27,164
  • 3
  • 36
  • 42

1 Answers1

10

How many triples on those RDF files ? I have tested rdflib and it won't scale much further than few tens of ktriples - if you are lucky. No way it really performs well for files with millions of triples.

The best parser out there is rapper from Redland Libraries. My first advice is to not use RDF/XML and go for ntriples. Ntriples is a lighter format than RDF/XML. You can transform from RDF/XML to ntriples using rapper:

rapper -i rdfxml -o ntriples YOUR_FILE.rdf > YOUR_FILE.ntriples

If you like Python you can use the Redland python bindings:

import RDF
parser=RDF.Parser(name="ntriples")
model=RDF.Model()
stream=parser.parse_into_model(model,"file://file_path",
                                      "http://your_base_uri.org")
for triple in model:
    print triple.subject, triple.predicate, triple.object

I have parsed fairly big files (couple of gigabyes) with redland libraries with no problem.

Eventually if you are handling big datasets you might need to assert your data into a scalable triple store, the one I normally use is 4store. 4store internally uses redland to parse RDF files. In the long term, I think, going for a scalable triple store is what you'll have to do. And with it you'll be able to use SPARQL to query your data and SPARQL/Update to insert and delete triples.

Manuel Salvadores
  • 16,287
  • 5
  • 37
  • 56
  • Thanks for reply, I'm using ntriples, but wanted to use alignments as well (It would be really cool to have confidence values on mappings, is it possible to have them in ntriples?). Not sure about number of entries but size is around 1Gb each file (for now 8 files total, but can grow up to 100). Probably now I'll start migrating to 4store + Redland ... – user52028778 Apr 15 '11 at 16:29
  • Alignments in ntriples ? if they can be expressed in RDF they can also be expressed in ntriples. And Yes, for the number of files and sizes that you mention ... definitely go for 4store. You'll find valuable help at http://groups.google.com/group/4store-support – Manuel Salvadores Apr 15 '11 at 16:33
  • 4store sounds a bit more complex than I thought, I just wanted to run it on my laptop for student project I'm working on. There is chance to consider only subset of the triples, do you know what are the maximum capabilities of using Redland only without 4store? – user52028778 Apr 19 '11 at 08:59
  • You could use redland stores. With SQLite or SleepyCat as RDBMS backends but ... I have not tried them myself. I cannot say about its scalability. Anyway I don't think a laptop for such amount of data is going to scale with any triple store. You might have to partition your data in different KBs. – Manuel Salvadores Apr 19 '11 at 09:21
  • @msalvadores could you take a look on [this question](http://stackoverflow.com/questions/42493215/parse-rdf-file-python)? – StuartDTO Feb 27 '17 at 19:12