How can you parse xml in Google Refine using jython/python ElementTree

Question

I trying to parse some xml in Google Refine using Jython and ElementTree but I'm struggling to find any documentation to help me getting this working (probably not helped by not being a python coder)

Here's an extract of the XML I'm trying to parse. I'm trying to return a joined string of all the dc:indentifier:

<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>J. Koenig</dc:creator>
  <dc:date>2010-01-13T15:47:38Z</dc:date>
  <dc:date>2010-01-13T15:47:38Z</dc:date>
  <dc:date>2010-01-13T15:47:38Z</dc:date>
  <dc:identifier>CCTL0059</dc:identifier>
  <dc:identifier>CCTL0059</dc:identifier>
  <dc:identifier>http://open.jorum.ac.uk:80/xmlui/handle/123456789/335</dc:identifier>
  <dc:format>application/pdf</dc:format>
</oai_dc:dc>

Here's the code I've got so far. This is a test to return anything as right now all I'm getting is 'Error: null'

from elementtree import ElementTree as ET
element = ET.parse(value)

namespace = "{http://www.openarchives.org/OAI/2.0/oai_dc/}"
e = element.findall('{0}identifier'.format(namespace))
for i in e:
   count += 1
return count

score 2 · Answer 1 · answered Dec 15 '11 at 00:53

You've used the wrong namespace. This works on Jython 2.5.1:

from xml.etree import ElementTree as ET
element = ET.fromstring(value) # `value` is a string with the xml from question

namespace = "{http://purl.org/dc/elements/1.1/}"
for e in element.getiterator(namespace+'identifier'):
    print e.text

Output

CCTL0059
CCTL0059
http://open.jorum.ac.uk:80/xmlui/handle/123456789/335

score 2 · Accepted Answer · edited Dec 15 '11 at 21:00

2

You can use a GREL expression like this, try it:

forEach(value.parseHtml().select("dc|identifier"),v,v.htmlText()).join(",")

For each identifier found, give me the htmlText and join them all with commas. parseHtml() uses Jsoup.org library and really just parses tags and structure. It also knows about parsing namespaces with the format of ns|identifier and is a nice way to get what your after in this case.

edited Dec 15 '11 at 21:00

musefan

47,875
21
135
185

answered Dec 15 '11 at 20:29

Thad Guidry

579
4
8

For some reason I couldn't get @j-f-sebastian or Tom's variation to work in Google Refine (might be an issue with my install of Refine?), but the GREL solution works for me – mhawksey Dec 20 '11 at 16:34

score 0 · Answer 3 · answered Dec 15 '11 at 17:35

Here's a slight tweak on J.F. Sebastian's version which can be pasted directly into Google Refine:

from xml.etree import ElementTree as ET
element = ET.fromstring(value)
namespace = "{http://purl.org/dc/elements/1.1/}"
return ','.join([e.text for e in element.getiterator(namespace+'identifier')])

It returns a comma separated list, but you can change the delimiter used in the return statement.

How can you parse xml in Google Refine using jython/python ElementTree

3 Answers3

Output