3

I trying to parse some xml in Google Refine using Jython and ElementTree but I'm struggling to find any documentation to help me getting this working (probably not helped by not being a python coder)

Here's an extract of the XML I'm trying to parse. I'm trying to return a joined string of all the dc:indentifier:

<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>J. Koenig</dc:creator>
  <dc:date>2010-01-13T15:47:38Z</dc:date>
  <dc:date>2010-01-13T15:47:38Z</dc:date>
  <dc:date>2010-01-13T15:47:38Z</dc:date>
  <dc:identifier>CCTL0059</dc:identifier>
  <dc:identifier>CCTL0059</dc:identifier>
  <dc:identifier>http://open.jorum.ac.uk:80/xmlui/handle/123456789/335</dc:identifier>
  <dc:format>application/pdf</dc:format>
</oai_dc:dc>

Here's the code I've got so far. This is a test to return anything as right now all I'm getting is 'Error: null'

from elementtree import ElementTree as ET
element = ET.parse(value)

namespace = "{http://www.openarchives.org/OAI/2.0/oai_dc/}"
e = element.findall('{0}identifier'.format(namespace))
for i in e:
   count += 1
return count
Community
  • 1
  • 1
mhawksey
  • 2,013
  • 5
  • 23
  • 61

3 Answers3

2

You've used the wrong namespace. This works on Jython 2.5.1:

from xml.etree import ElementTree as ET
element = ET.fromstring(value) # `value` is a string with the xml from question

namespace = "{http://purl.org/dc/elements/1.1/}"
for e in element.getiterator(namespace+'identifier'):
    print e.text

Output

CCTL0059
CCTL0059
http://open.jorum.ac.uk:80/xmlui/handle/123456789/335
jfs
  • 399,953
  • 195
  • 994
  • 1,670
2

You can use a GREL expression like this, try it:

forEach(value.parseHtml().select("dc|identifier"),v,v.htmlText()).join(",")

For each identifier found, give me the htmlText and join them all with commas. parseHtml() uses Jsoup.org library and really just parses tags and structure. It also knows about parsing namespaces with the format of ns|identifier and is a nice way to get what your after in this case.

musefan
  • 47,875
  • 21
  • 135
  • 185
Thad Guidry
  • 579
  • 4
  • 8
  • For some reason I couldn't get @j-f-sebastian or Tom's variation to work in Google Refine (might be an issue with my install of Refine?), but the GREL solution works for me – mhawksey Dec 20 '11 at 16:34
0

Here's a slight tweak on J.F. Sebastian's version which can be pasted directly into Google Refine:

from xml.etree import ElementTree as ET
element = ET.fromstring(value)
namespace = "{http://purl.org/dc/elements/1.1/}"
return ','.join([e.text for e in element.getiterator(namespace+'identifier')])

It returns a comma separated list, but you can change the delimiter used in the return statement.

Tom Morris
  • 10,490
  • 32
  • 53