1

I have a big n-quads file with a lot of statements included in a big number of different graphs The lines of the file are as follow :

<http://voag.linkedmodel.org/voag#useGuidelines> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/1999/02/22-rdf-syntax-ns#Property> <http://voag.linkedmodel.org/schema/voag> .

The fourth element correspond to the graph's URI.

I would like to parse this file and split all the different graphs in new files or datastructures, one object per graph, preferably with RDFlib. I really don't know how to tackle this problem, so any help would be appreciated.

Alexis Pister
  • 449
  • 3
  • 13
  • 1
    I'd just go with Linux command utilities like awk, grep, etc. - but if you really want to use rdflib, where exactly is the problem? The docs are online, loading graph and processing graphs is explained there very well. Just use a `ConjunctiveGraph` in which you load the file, and you're basically done or can just do whatever you want with each graph based on the N-Quads file – UninformedUser May 07 '19 at 06:02
  • Well I loaded my file in a ConjuctiveGraph but I don't see how to split the different graphs afterwards, the documentation of rdflib is not very explicit – Alexis Pister May 07 '19 at 08:14
  • 1
    Nah, I do not agree with your statement: https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#rdflib.graph.ConjunctiveGraph - you can see the method `contexts()` which returns all contexts aka named graphs and with the `triples()` method you have access to all triples of a given context – UninformedUser May 07 '19 at 08:33

1 Answers1

0

If the lines are such that all the graph URI's are together in a sequence then you can use itertools' groupby to parse each one in turn:

from itertools import groupby
import rdflib
def parse_nquads(lines):
    for group, quad_lines in groupby(lines, get_quad_label):
        graph = rdflib.Graph(identifier=group)
        graph.parse(data=''.join(quad_lines), format='nquads')
        yield graph

If the fourth element is always present and a URI (which is not guaranteed in the specification) you can find it by searching for whitespace.

import re
RDF_QUAD_LABEL_RE = re.compile("[ \t]+<([^>]*)>[ \t].\n$")
def get_quad_label(line):
    return RDF_QUAD_LABEL_RE.search(line).group(1)

Then you can process each graph from the input file into a new file or dataset

with open('myfile.nquads', 'rt') as f:
  for graph in parse_nquads(f):
    ...
Skeptric
  • 1
  • 3