I've been trying to parse an xml file (JMdict_e.xml) for translation purposes. However, parsing of the whole file returns an incomplete dataset.
Code:
tree2 = ET.ElementTree(file = "JMdict_e.xml")
root2 = tree2.getroot()
print([i.tag for i in root2[55711]])
print([i.text for i in root2[55711][4]])
returns the following entries:
Result:
['ent_seq', 'k_ele', 'r_ele', 'r_ele', 'sense']
["Godan verb with `ru' ending", 'intransitive verb', 'to become less capable', 'to grow dull', 'to become blunt', 'to weaken']
Conversely, when the single entry is extracted from the original xml database, the following is obtained:
Code:
import xml.etree.cElementTree as ET
tree = ET.ElementTree(file = "new.xml")
root = tree.getroot()
print([i.tag for i in root[1]])
for i in root[1]:
print([j.text for j in i if i.tag == "sense"])
result:
['ent_seq', 'k_ele', 'r_ele', 'r_ele', 'sense', 'sense', 'sense', 'sense', 'sense']
##Truncated empty lists
['にぶい', 'adjective (keiyoushi)', 'dull (e.g. a knife)', 'blunt']
['のろい is usu. in kana', 'thickheaded', 'obtuse', 'stupid']
['にぶい', 'dull (sound, color, etc.)', 'dim (light)']
['slow', 'sluggish', 'inert', 'lethargic']
['のろい', 'indulgent (esp. to the opposite sex)', 'doting']
I've been picking apart the data for a while, but have not been able to identify a cause for this, but suspect that another entry in the xml file may override what is shown.
XML fragments
<JMdict>
<entry>
<ent_seq>1000000</ent_seq>
<r_ele>
<reb>ヽ</reb>
</r_ele>
<r_ele>
<reb>くりかえし</reb>
</r_ele>
<sense>
<pos>&n;</pos>
<gloss>repetition mark in katakana</gloss>
</sense>
</entry>
<entry>
<ent_seq>1582430</ent_seq>
<k_ele>
<keb>鈍い</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news2</ke_pri>
<ke_pri>nf30</ke_pri>
</k_ele>
<r_ele>
<reb>にぶい</reb>
<re_pri>ichi1</re_pri>
<re_pri>news2</re_pri>
<re_pri>nf30</re_pri>
</r_ele>
<r_ele>
<reb>のろい</reb>
<re_pri>ichi1</re_pri>
</r_ele>
<sense>
<stagr>にぶい</stagr>
<pos>&adj-i;</pos>
<gloss>dull (e.g. a knife)</gloss>
<gloss>blunt</gloss>
</sense>
<sense>
<s_inf>のろい is usu. in kana</s_inf>
<gloss>thickheaded</gloss>
<gloss>obtuse</gloss>
<gloss>stupid</gloss>
</sense>
<sense>
<stagr>にぶい</stagr>
<gloss>dull (sound, color, etc.)</gloss>
<gloss>dim (light)</gloss>
</sense>
<sense>
<gloss>slow</gloss>
<gloss>sluggish</gloss>
<gloss>inert</gloss>
<gloss>lethargic</gloss>
</sense>
<sense>
<stagr>のろい</stagr>
<gloss>indulgent (esp. to the opposite sex)</gloss>
<gloss>doting</gloss>
</sense>
</entry>
</JMdict>
XML file in question
http://ftp.monash.edu.au/pub/nihongo/JMdict_e.gz
Issue clarification
Note that in the first result, only one sense entry is present as comapared to the 5 independent entries of the second result despite identical entry composition (both xml files contain the entry as is). Additionally, the result of the first set is incomplete. Most importantly, the parser misses out on the 'stagr'elements. If the parse was fully functional, both results would be expected to be identical.