1

I've been trying to parse an xml file (JMdict_e.xml) for translation purposes. However, parsing of the whole file returns an incomplete dataset.

Code:

tree2 = ET.ElementTree(file = "JMdict_e.xml")
root2 = tree2.getroot()

print([i.tag for i in root2[55711]])
print([i.text for i in root2[55711][4]])

returns the following entries:

Result:

['ent_seq', 'k_ele', 'r_ele', 'r_ele', 'sense']
["Godan verb with `ru' ending", 'intransitive verb', 'to become less     capable', 'to grow dull', 'to become blunt', 'to weaken']

Conversely, when the single entry is extracted from the original xml database, the following is obtained:

Code:

import xml.etree.cElementTree as ET

tree = ET.ElementTree(file = "new.xml")
root = tree.getroot()
print([i.tag for i in root[1]])
for i in root[1]:
    print([j.text for j in i if i.tag == "sense"])

result:

['ent_seq', 'k_ele', 'r_ele', 'r_ele', 'sense', 'sense', 'sense', 'sense', 'sense']
##Truncated empty lists
['にぶい', 'adjective (keiyoushi)', 'dull (e.g. a knife)', 'blunt']
['のろい is usu. in kana', 'thickheaded', 'obtuse', 'stupid']
['にぶい', 'dull (sound, color, etc.)', 'dim (light)']
['slow', 'sluggish', 'inert', 'lethargic']
['のろい', 'indulgent (esp. to the opposite sex)', 'doting']

I've been picking apart the data for a while, but have not been able to identify a cause for this, but suspect that another entry in the xml file may override what is shown.

XML fragments

<JMdict>
<entry>
<ent_seq>1000000</ent_seq>
<r_ele>
<reb>ヽ</reb>
</r_ele>
<r_ele>
<reb>くりかえし</reb>
</r_ele>
<sense>
<pos>&n;</pos>
<gloss>repetition mark in katakana</gloss>
</sense>
</entry>
<entry>
<ent_seq>1582430</ent_seq>
<k_ele>
<keb>鈍い</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news2</ke_pri>
<ke_pri>nf30</ke_pri>
</k_ele>
<r_ele>
<reb>にぶい</reb>
<re_pri>ichi1</re_pri>
<re_pri>news2</re_pri>
<re_pri>nf30</re_pri>
</r_ele>
<r_ele>
<reb>のろい</reb>
<re_pri>ichi1</re_pri>
</r_ele>
<sense>
<stagr>にぶい</stagr>
<pos>&adj-i;</pos>
<gloss>dull (e.g. a knife)</gloss>
<gloss>blunt</gloss>
</sense>
<sense>
<s_inf>のろい is usu. in kana</s_inf>
<gloss>thickheaded</gloss>
<gloss>obtuse</gloss>
<gloss>stupid</gloss>
</sense>
<sense>
<stagr>にぶい</stagr>
<gloss>dull (sound, color, etc.)</gloss>
<gloss>dim (light)</gloss>
</sense>
<sense>
<gloss>slow</gloss>
<gloss>sluggish</gloss>
<gloss>inert</gloss>
<gloss>lethargic</gloss>
</sense>
<sense>
<stagr>のろい</stagr>
<gloss>indulgent (esp. to the opposite sex)</gloss>
<gloss>doting</gloss>
</sense>
</entry>
</JMdict>

XML file in question

http://ftp.monash.edu.au/pub/nihongo/JMdict_e.gz

Issue clarification

Note that in the first result, only one sense entry is present as comapared to the 5 independent entries of the second result despite identical entry composition (both xml files contain the entry as is). Additionally, the result of the first set is incomplete. Most importantly, the parser misses out on the 'stagr'elements. If the parse was fully functional, both results would be expected to be identical.

Kenneth Lim
  • 75
  • 2
  • 9

1 Answers1

0

What I think that happens here is that you are looking at two different nodes. If you take a look at the ent_seq of the first code. You will see that it is 1582510. I've searched for it in the original file as well as dumped the XML object using ET.dump and the object you are actually analyzing is this:

<entry>
<ent_seq>1582510</ent_seq>
<k_ele>
<keb>内股膏薬</keb>
</k_ele>
<r_ele>
<reb>うちまたこうやく</reb>
</r_ele>
<r_ele>
<reb>うちまたごうやく</reb>
</r_ele>
<sense>
<pos>noun (common) (futsuumeishi)</pos>
<misc>yojijukugo</misc>
<gloss xml:lang="eng">double-dealer</gloss>
<gloss xml:lang="eng">fence-sitter</gloss>
<gloss xml:lang="eng">timeserver</gloss>
<gloss xml:lang="eng">moving back and forth between two sides in a     conflict</gloss>
<gloss xml:lang="eng">duplicity</gloss>
<gloss xml:lang="eng">turncoat</gloss>
</sense>
</entry>

Which is equivalent of the output you get in first code. The second code is actually analyzing the object with ent_seq 1582430 which is a completely different object.

Santiago Alessandri
  • 6,630
  • 30
  • 46
  • Ah, I missed two very similar words with similar meanings. Much thanks for the help, this will be a reminder that the numerical tags exist for a reason. – Kenneth Lim Feb 14 '15 at 07:47