0

I am trying to parse an xml file, and i only need one attribute. Is there any easy way to get to said attribute?

The file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
 <chunk id="ch1" type="p">
  <sentence id="s1">
   <tok>
    <orth>testowy</orth>
    <lex disamb="1"><base>testowy</base><ctag>adj:sg:nom:m3:pos</ctag></lex>
    <prop key="sense:ukb:syns_id">1358</prop>
    <prop key="sense:ukb:syns_rank">1358/1.0000000000</prop>
    <prop key="sense:ukb:unitsstr">próbny.1(42:jak) testowy.1(42:jak)</prop>
   </tok>
   <tok>
    <orth>plik</orth>
    <lex disamb="1"><base>plik</base><ctag>subst:sg:nom:m3</ctag></lex>
    <prop key="sense:ukb:syns_id">35864</prop>
    <prop key="sense:ukb:syns_rank">35864/0.6075684112 2248/0.3924315888</prop>
    <prop key="sense:ukb:unitsstr">plik.2(7:por)</prop>
   </tok>
  </sentence>
 </chunk>
</chunkList>

And it will have variable number of <tok> branches, and each <tok> branch might have different number of keys. The only attribute that i need to extract is syns_id.

It will probably be one HUGE xml file, im thinking few hundred megabytes. Or about 100k of small ones, with just around 5-10 <tok>'s.

What i need, is a list containing all of these syns_id's. How should i approach this? I think regexes would solve it, but i have not used them yet. Or is there any faster/better way?

KONAKONA
  • 87
  • 5
  • There are never going to be any comments in the XML, because it is automatically generated. This is word sense disambiguator, which tags each word in sentence with correct meanings, and produces such XML's as output. – KONAKONA Nov 02 '17 at 23:52
  • 1
    XML isn't suitable for random access. That said, a few hundred megabytes should be tolerable with LXML + an xpath query to pull out the bits you care about. – o11c Nov 02 '17 at 23:53
  • A few hundred MB - is that an issue on your machine? What's the RAM limitation? – Thomas Weller Nov 02 '17 at 23:54
  • If you need lightweight XML parsing, look for a [SAX parser](https://en.wikipedia.org/wiki/Simple_API_for_XML) – Thomas Weller Nov 02 '17 at 23:55
  • RAM is not a limitation i hope - at the moment i got 4GB and a lot of swap. – KONAKONA Nov 02 '17 at 23:56
  • Then --- just implement is with any XML parser you know – Thomas Weller Nov 02 '17 at 23:56
  • My problem is that path to the syns_id is going to vary almost each time, and i can't find any easy way to access this attribute easily. For example ElementTree requires me to write specific path to attribute, unless i am in the wrong? – KONAKONA Nov 02 '17 at 23:58
  • @ThomasWeller Not *any* XML parser. A pure-python non-streaming parser would be a pain for this. Any C-based parser should be fine, and streaming ones may be fine. – o11c Nov 02 '17 at 23:59

1 Answers1

1

I don't know exactly how scalable this is, but this would be my first attempt in any case:

import lxml.etree

et = lxml.etree.parse('big.xml')
et.xpath('//prop[@key="sense:ukb:syns_id"]/text()')

On your sample, this produces:

['1358', '35864']

(though note that the strings are actually instances of lxml.etree._ElementUnicodeResult, which is a subclass of str)

o11c
  • 15,265
  • 4
  • 50
  • 75
  • and this is precisely what i was looking for, i was having troubles with variable path to the syns_id key, and it looks like it solves it neatly. Thank you! – KONAKONA Nov 02 '17 at 23:59
  • @KONAKONA Note that there *are* valid reasons to avoid `//` searches, but I don't think they apply here. If they *do* turn out to matter, you could write it out as `/chunkList/chunk/sentence/tok/`, which is the path to the all the `prop` elements in your sample. – o11c Nov 03 '17 at 00:02
  • For another question, after installing lxml with pip, it cannot find etree. Did you use some mind-shortcut? I tried both import lxml.etree and from lxml import etree. – KONAKONA Nov 03 '17 at 00:04
  • @KONAKONA Are you certain that you installed it using the correct version of pip? (For example, could you have installed it with a Python 2.7 version of pip when you're running with Python 3?) Try `import lxml`. – jpmc26 Nov 03 '17 at 01:41