2

I'm a Haskell beginner. I'd like to be able to parse Project Gutenberg RDF metadata in Haskell, but I don't really know where to start. It's RDF XML, so it's basically XML with some funky namespace stuff thrown in. I saw that there's the rdf4h library, but it looks complicated. For parsing XML, there are a bunch of libraries, but most use arrows and things that I don't understand, and look needlessly complicated. There are other libraries like xml, but I can't find any tutorials, and the documentation seems to only be code comments. I just wanted to know if anyone had thought about a problem like this, and whether there wasn't a solution that I hadn't considered. Thanks in advance!

Edit: here's an example:

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xml:base="http://www.gutenberg.org/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:cc="http://web.resource.org/cc/"
  xmlns:dcam="http://purl.org/dc/dcam/"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:pgterms="http://www.gutenberg.org/2009/pgterms/"
>
  <cc:Work rdf:about="">
    <rdfs:comment>Archives containing the RDF files for *all* our books can be downloaded at
            http://www.gutenberg.org/wiki/Gutenberg:Feeds#The_Complete_Project_Gutenberg_Catalog</rdfs:comment>
    <cc:license rdf:resource="https://creativecommons.org/publicdomain/zero/1.0/"/>
  </cc:Work>
  <pgterms:ebook rdf:about="ebooks/20">
    <pgterms:bookshelf>
      <rdf:Description rdf:nodeID="N3f8445072d8e4499b2646626f94866e0">
        <rdf:value>Poetry</rdf:value>
        <dcam:memberOf rdf:resource="2009/pgterms/Bookshelf"/>
      </rdf:Description>
    </pgterms:bookshelf>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/20.rdf">
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2017-03-16T05:01:13.615047</dcterms:modified>
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">12133</dcterms:extent>
        <dcterms:isFormatOf rdf:resource="ebooks/20"/>
        <dcterms:format>
          <rdf:Description rdf:nodeID="N735ba077c8424051b6470a92682aaa5e">
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/rdf+xml</rdf:value>
          </rdf:Description>
        </dcterms:format>
      </pgterms:file>
    </dcterms:hasFormat>
    <dcterms:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1991-10-01</dcterms:issued>
    <dcterms:title>Paradise Lost</dcterms:title>
    <dcterms:subject>
      <rdf:Description rdf:nodeID="Ne259525c666c4886a996acbdddca0682">
        <rdf:value>PR</rdf:value>
        <dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCC"/>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/files/20/20.txt">
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">507133</dcterms:extent>
        <dcterms:isFormatOf rdf:resource="ebooks/20"/>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2011-03-02T06:33:54</dcterms:modified>
        <dcterms:format>
          <rdf:Description rdf:nodeID="Nbd1740a2927845058b0fe43326dcc48b">
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">text/plain; charset=us-ascii</rdf:value>
          </rdf:Description>
        </dcterms:format>
      </pgterms:file>
    </dcterms:hasFormat>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/20.epub.images">
        <dcterms:isFormatOf rdf:resource="ebooks/20"/>
        <dcterms:format>
          <rdf:Description rdf:nodeID="Nb08f3d2980e64e91a402eb5b205c10bc">
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/epub+zip</rdf:value>
          </rdf:Description>
        </dcterms:format>
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">232622</dcterms:extent>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2017-03-01T01:04:17.425321</dcterms:modified>
      </pgterms:file>
    </dcterms:hasFormat>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/20.kindle.images">
        <dcterms:isFormatOf rdf:resource="ebooks/20"/>
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">933970</dcterms:extent>
        <dcterms:format>
          <rdf:Description rdf:nodeID="Nff1df57b9552466d96b114f20424b5a2">
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/x-mobipocket-ebook</rdf:value>
          </rdf:Description>
        </dcterms:format>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2017-03-01T01:04:21.321235</dcterms:modified>
      </pgterms:file>
    </dcterms:hasFormat>
    <dcterms:language>
      <rdf:Description rdf:nodeID="N91273d0bffc74be393cda307d2b05137">
        <rdf:value rdf:datatype="http://purl.org/dc/terms/RFC4646">en</rdf:value>
      </rdf:Description>
    </dcterms:language>
    <dcterms:subject>
      <rdf:Description rdf:nodeID="N5e35fb378b37483ca6ef7a08f27cf936">
        <dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCSH"/>
        <rdf:value>Eve (Biblical figure) -- Poetry</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:license rdf:resource="license"/>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/20.html.images">
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">614618</dcterms:extent>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2017-03-01T01:04:16.685338</dcterms:modified>
        <dcterms:isFormatOf rdf:resource="ebooks/20"/>
        <dcterms:format>
          <rdf:Description rdf:nodeID="N7567260ec2fd48c0be3d2858e08ac35d">
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">text/html</rdf:value>
          </rdf:Description>
        </dcterms:format>
      </pgterms:file>
    </dcterms:hasFormat>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/20.epub.noimages">
        <dcterms:isFormatOf rdf:resource="ebooks/20"/>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2017-03-01T01:04:17.695324</dcterms:modified>
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">232623</dcterms:extent>
        <dcterms:format>
          <rdf:Description rdf:nodeID="Nb640302bc2a84a31b0e154318df817d1">
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/epub+zip</rdf:value>
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
          </rdf:Description>
        </dcterms:format>
      </pgterms:file>
    </dcterms:hasFormat>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/20.kindle.noimages">
        <dcterms:isFormatOf rdf:resource="ebooks/20"/>
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">933967</dcterms:extent>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2017-03-01T01:04:24.846165</dcterms:modified>
        <dcterms:format>
          <rdf:Description rdf:nodeID="N1857bba1f5484e3d84846e1a554ec593">
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/x-mobipocket-ebook</rdf:value>
          </rdf:Description>
        </dcterms:format>
      </pgterms:file>
    </dcterms:hasFormat>
    <dcterms:publisher>Project Gutenberg</dcterms:publisher>
    <dcterms:rights>Public domain in the USA.</dcterms:rights>
    <dcterms:creator>
      <pgterms:agent rdf:about="2009/agents/17">
        <pgterms:deathdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1674</pgterms:deathdate>
        <pgterms:webpage rdf:resource="http://en.wikipedia.org/wiki/John_Milton"/>
        <pgterms:name>Milton, John</pgterms:name>
        <pgterms:birthdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1608</pgterms:birthdate>
      </pgterms:agent>
    </dcterms:creator>
    <dcterms:type>
      <rdf:Description rdf:nodeID="N0f6e6d76b1ff4ea9a2c5c37949efe82b">
        <dcam:memberOf rdf:resource="http://purl.org/dc/terms/DCMIType"/>
        <rdf:value>Text</rdf:value>
      </rdf:Description>
    </dcterms:type>
    <dcterms:subject>
      <rdf:Description rdf:nodeID="N202624c4b5994d39a3ab8bf0a2a31d95">
        <dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCSH"/>
        <rdf:value>Adam (Biblical figure) -- Poetry</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/20.html.noimages">
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">614618</dcterms:extent>
        <dcterms:format>
          <rdf:Description rdf:nodeID="N79f919d14da448e19eb05c444322ddd2">
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">text/html</rdf:value>
          </rdf:Description>
        </dcterms:format>
        <dcterms:isFormatOf rdf:resource="ebooks/20"/>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2017-03-01T01:04:16.955332</dcterms:modified>
      </pgterms:file>
    </dcterms:hasFormat>
    <pgterms:bookshelf>
      <rdf:Description rdf:nodeID="Nec598f664c934ed49ba3c0168ef09615">
        <rdf:value>Banned Books from Anne Haight's list</rdf:value>
        <dcam:memberOf rdf:resource="2009/pgterms/Bookshelf"/>
      </rdf:Description>
    </pgterms:bookshelf>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/20.txt.utf-8">
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">507105</dcterms:extent>
        <dcterms:format>
          <rdf:Description rdf:nodeID="N069b84f8b10844e9a6c713f4c163880b">
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">text/plain</rdf:value>
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
          </rdf:Description>
        </dcterms:format>
        <dcterms:isFormatOf rdf:resource="ebooks/20"/>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2017-03-01T01:04:15.953358</dcterms:modified>
      </pgterms:file>
    </dcterms:hasFormat>
    <dcterms:subject>
      <rdf:Description rdf:nodeID="Nb489692851fa496d96b1a7fdf7a71b21">
        <dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCSH"/>
        <rdf:value>Fall of man -- Poetry</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <pgterms:downloads rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2088</pgterms:downloads>
    <dcterms:subject>
      <rdf:Description rdf:nodeID="Naa6849a7660b4039baadec8af58f0c58">
        <dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCSH"/>
        <rdf:value>Bible. Genesis -- History of Biblical events -- Poetry</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/files/20/20.zip">
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">205748</dcterms:extent>
        <dcterms:format>
          <rdf:Description rdf:nodeID="N19cf968278bc4922bd87b17209c20d94">
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">text/plain; charset=us-ascii</rdf:value>
          </rdf:Description>
        </dcterms:format>
        <dcterms:isFormatOf rdf:resource="ebooks/20"/>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2011-03-02T06:34:42</dcterms:modified>
        <dcterms:format>
          <rdf:Description rdf:nodeID="N94c2881f340a49c18246b69af3abcf12">
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/zip</rdf:value>
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
          </rdf:Description>
        </dcterms:format>
      </pgterms:file>
    </dcterms:hasFormat>
  </pgterms:ebook>
  <rdf:Description rdf:about="http://en.wikipedia.org/wiki/John_Milton">
    <dcterms:description>Wikipedia</dcterms:description>
  </rdf:Description>
</rdf:RDF>
Jonathan
  • 10,571
  • 13
  • 67
  • 103
  • 1
    Have you looked at xml-conduit? It doesn't require knowledge of any complicated abstractions or advanced type system features. – Marcelo Zabani Mar 15 '18 at 16:30
  • 2
    `rdf4h` doesn't look that complex too. You just need to call `parseString`/`parseFile` with `XmlParser` argument. – arrowd Mar 15 '18 at 16:34
  • Please avoid XML parsing. – Stanislav Kralin Mar 15 '18 at 17:01
  • @StanislavKralin I presume you mean XML parsers aren't good and/or convenient tools to correctly parse RDF metadata? (I'm asking because I'm not at all familiar with the RDF format.) – duplode Mar 15 '18 at 17:13
  • 1
    [Your Python fellow opinion](https://stackoverflow.com/questions/45203061/how-to-limit-the-scope-of-element-extraction-between-the-start-and-end-tag-of-a#comment78478172_45203061). Try to find an RDF library for Haskell. Also, RDF can be serialized in RDF in different ways (respectively, different serializations can be equivalent in abstract RDF syntax). – Stanislav Kralin Mar 15 '18 at 17:25
  • For the sake of completeness, links to documentation for [*xml-conduit*](https://www.stackage.org/lts-11.0/package/xml-conduit-1.8.0) and [*hxt*](https://hackage.haskell.org/package/hxt-9.3.1.16) (the latter presumably being the XML library with an arrow-based interface you allude to). – duplode Mar 15 '18 at 17:32
  • 1
    Or better load RDF into local SPARQL endpoint and then query RDF with SPARQL (using endpoint GUI or `hsparql`). Unfortunately, it seems that `rdf4h` can't perform SPARQL queries. – Stanislav Kralin Mar 15 '18 at 17:38
  • @arrowd, How would I go about parsing XML like the above using rdf4h? And what kinds of data structures would that produce? I can't seem to find any good tutorials around. – Jonathan Mar 15 '18 at 21:42
  • @StanislavKralin, how would I go about doing that? – Jonathan Mar 15 '18 at 22:23
  • @Jono Into a triples list for instance: https://github.com/robstewart57/rdf4h/blob/master/examples/ParseURLs.hs – arrowd Mar 16 '18 at 06:25
  • @Jono, possibly [tag:fuseki] is the best SPARQL endpoint to start with. Unfortunately, there is no SPARQL endpoints or triplestores with SPARQL support written in Haskell. – Stanislav Kralin Mar 16 '18 at 06:37
  • 1
    @StanislavKralin Why not simply use an HTTP GET request and parse e.g. JSON resultset? Shouldn't be that difficult, though, indeed some more lines of code. – UninformedUser Mar 16 '18 at 07:11
  • @AKSW, [their endpoint](http://wifo5-03.informatik.uni-mannheim.de/gutendata/sparql) is not available (there is a dump only), or I can't understand you. – Stanislav Kralin Mar 16 '18 at 07:32
  • 1
    Well, I meant just what you said. Load the dump into a triple store, e.g. Fuseki and query via HTTP. Better than parsing XML data which encodes RDF... – UninformedUser Mar 16 '18 at 07:40
  • @AKSW, OK. Haskellers also have the `hsparql` package, in which SPARQL queries look similar to SPARQL algebra :-). – Stanislav Kralin Mar 16 '18 at 08:01
  • @arrowd, `TriplesList` doesn't seem to exist in Data.RDF. Either I just can't find it or that example is out of date. – Jonathan Mar 23 '18 at 01:39
  • @Jono Then use example provided y the package itself: https://hackage.haskell.org/package/rdf4h-3.0.1/src/examples/ParseURLs.hs – arrowd Mar 23 '18 at 06:04

0 Answers0