I am trying to parse an xml file, and i only need one attribute. Is there any easy way to get to said attribute?
The file looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
<chunk id="ch1" type="p">
<sentence id="s1">
<tok>
<orth>testowy</orth>
<lex disamb="1"><base>testowy</base><ctag>adj:sg:nom:m3:pos</ctag></lex>
<prop key="sense:ukb:syns_id">1358</prop>
<prop key="sense:ukb:syns_rank">1358/1.0000000000</prop>
<prop key="sense:ukb:unitsstr">próbny.1(42:jak) testowy.1(42:jak)</prop>
</tok>
<tok>
<orth>plik</orth>
<lex disamb="1"><base>plik</base><ctag>subst:sg:nom:m3</ctag></lex>
<prop key="sense:ukb:syns_id">35864</prop>
<prop key="sense:ukb:syns_rank">35864/0.6075684112 2248/0.3924315888</prop>
<prop key="sense:ukb:unitsstr">plik.2(7:por)</prop>
</tok>
</sentence>
</chunk>
</chunkList>
And it will have variable number of <tok>
branches, and each <tok>
branch might have different number of keys.
The only attribute that i need to extract is syns_id.
It will probably be one HUGE xml file, im thinking few hundred megabytes.
Or about 100k of small ones, with just around 5-10 <tok>
's.
What i need, is a list containing all of these syns_id's. How should i approach this? I think regexes would solve it, but i have not used them yet. Or is there any faster/better way?