i have an xml/rdf file with this record:
<lemon:LexicalEntry rdf:about="ita-tachimetro-n">
what i want is to extract the object of the triple and store it in a dictionary, where the key is the word (in the example: tachimetro) and the value is the pos- part of speech: in the example "n" for noun.
so this is what i've done:
from lxml import etree
import re
parser = etree.XMLParser(encoding="utf-8")
regex = re.compile(r'^ita-(?P<word>[A-Za-z+]+)-(?P<pos>[anrv]{1})$')
doc = etree.parse('wn-ita-lemon.xml',parser=parser)
italian_vocabolary = {}
for df in doc.xpath('//lemon:LexicalEntry',namespaces={'lemon':'http://lemon-model.net/lemon#'}):
for k,v in df.attrib.items():
rx = re.search(regex,v)
if rx is not None:
ita_vocabolary[rx.group('word')] = rx.group('pos')
else:
print(v) #to check the value
now the string are of two kind basically; single word like the esample above, and expression like ita-Locusta+migratoria-n (for that i put a + in the class regex).
Now there is some words that the regex doesnt retrieve, and they are accented word like: ita-sentenziosit%C3%A0-n it should be ita-sentenziosità-n
The xml file didn' have the doctype, i inserted later:
<?xml version="1.0" encoding="UTF-8"?>
but it didnt work anyway, even with given the correct encoding to the etree parser.