1

This is a picture of my OpenRefine project. I need to extract all the instances of skos:CloseMacth URIs from an RDF/XML column into a separate column in OpenRefine.

This is my RDF/XML code:

<rdf:RDF xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/1999/02/22-rdf-schema#" xmlns:cs="http://purl.org/vocab/changeset/schema#" xmlns:skosxl="http://www.w3.org/2008/05/skos-xl#">
  <rdf:Description rdf:about="http://id.loc.gov/authorities/subjects/sh85145648">
    <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/>
    <skos:prefLabel xml:lang="en">Water-supply</skos:prefLabel>
    <skosxl:altLabel>
      <rdf:Description>
    <rdf:type rdf:resource="http://www.w3.org/2008/05/skos-xl#Label"/>
    <skosxl:literalForm xml:lang="en">Availability, Water</skosxl:literalForm>
      </rdf:Description>
    </skosxl:altLabel>
    <skosxl:altLabel>
      <rdf:Description>
    <rdf:type rdf:resource="http://www.w3.org/2008/05/skos-xl#Label"/>
    <skosxl:literalForm xml:lang="en">Water availability</skosxl:literalForm>
      </rdf:Description>
    </skosxl:altLabel>
    <skosxl:altLabel>
      <rdf:Description>
    <rdf:type rdf:resource="http://www.w3.org/2008/05/skos-xl#Label"/>
    <skosxl:literalForm xml:lang="en">Water resources</skosxl:literalForm>
      </rdf:Description>
    </skosxl:altLabel>
    <skos:closeMatch rdf:resource="http://www.yso.fi/onto/yso/p9967"/>
    <skos:closeMatch rdf:resource="http://id.worldcat.org/fast/1172350"/>
    <skos:closeMatch rdf:resource="http://www.wikidata.org/entity/Q1061108"/>
    <skos:closeMatch rdf:resource="http://id.worldcat.org/fast/1172350"/>
    <skos:closeMatch rdf:resource="http://www.wikidata.org/entity/Q1061108"/>
    <skos:closeMatch rdf:resource="http://www.yso.fi/onto/yso/p9967"/>
    <skos:changeNote>
      <cs:ChangeSet>
    <cs:subjectOfChange rdf:resource="http://id.loc.gov/authorities/subjects/sh85145648"/>
    <cs:creatorName rdf:resource="http://id.loc.gov/vocabulary/organizations/dlc"/>
    <cs:createdDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">1986-02-11T00:00:00</cs:createdDate>
    <cs:changeReason rdf:datatype="http://www.w3.org/2001/XMLSchema#string">new</cs:changeReason>
      </cs:ChangeSet>
    </skos:changeNote>
    <skos:changeNote>
      <cs:ChangeSet>
    <cs:subjectOfChange rdf:resource="http://id.loc.gov/authorities/subjects/sh85145648"/>
    <cs:creatorName rdf:resource="http://id.loc.gov/vocabulary/organizations/dlc"/>
    <cs:createdDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-11-17T07:36:37</cs:createdDate>
    <cs:changeReason rdf:datatype="http://www.w3.org/2001/XMLSchema#string">revised</cs:changeReason>
      </cs:ChangeSet>
    </skos:changeNote>
  </rdf:Description>
</rdf:RDF>

I tried this code value.parseHtml().select('skos|closematch') to add a column based on the RDF/XML column, but it doesn't work.

Tom Morris
  • 10,490
  • 32
  • 53
Zayn Sab
  • 13
  • 3

2 Answers2

1

Your code is pretty close. Were you examining the display of the preview column to help guide you?

Your code returns an array of six XML elements. The things that you're missing are:

  • an iterator - forEach()
  • a function to fetch the value of the attribute - htmlAttr()
  • something to convert the array to a single value which can be stored in the column - join()

Altogether it'll look like: forEach(value.parseHtml().select('skos|closeMatch'), element, element.htmlAttr('rdf:resource')).join('|')

I actually built this from the inside out by starting with a single element: value.parseHtml().select('skos|closeMatch')[0] to see what it looked like and then adding the .htmlAttr('rdf:resource') before wrapping the entire thing with forEach(...).join('|') (Obviously you can choose whatever delimiter you find most useful)

Update: your data has duplicates, so you might want to add .uniques() like:

forEach(value.parseHtml().select('skos|closeMatch'), element, element.htmlAttr('rdf:resource')).uniques().join('|')

Tom Morris
  • 10,490
  • 32
  • 53
  • Very nice answer! Small question: how do you know that for `skos:closeMatch` the colon must be substituted with a `|`, but 'rdf:resource` can pass as is? I've been looking for that, couldn't find it. – RolfBly Oct 18 '22 at 10:37
  • 1
    I'm not sure if attribute names support namespacing. The literal string was the first thing I tried, and it worked, so I didn't investigate alternatives. – Tom Morris Oct 18 '22 at 17:22
0

What is your desired result? I just copied your code into OR's Clipboard and selected rdf:Description as first XML element. I assume the code in your question is just a short sample and you have in fact several rdf:Description's inside the rdf:RDF element (i.e. ). So you get a record for each rdf:Description.

This is what I get in the Configure parsing options pane...

screenshot1

And this is what I get when I do Create Project and switch to row mode.

Screenshot2

Is the third column what you mean by this (?):

all the instances of skos:CloseMacth URIs from an RDF/XML column into a separate column in OpenRefine.

If not, please clarify editing you question.

RolfBly
  • 3,612
  • 5
  • 32
  • 46
  • Thanks for answering my question. but this RDF/XML is not my whole project. Actually, I have a column in my openRefine project which contains the rdf/xml code (i inserted the image in the question) and i need to extract the skos:closeMatch instances to another column in this project. The third column you got is what i want but not in a new project. – Zayn Sab Oct 16 '22 at 06:15