accented characters in xml file: handling with python element.etree

Question

i have an xml/rdf file with this record:

<lemon:LexicalEntry rdf:about="ita-tachimetro-n">

what i want is to extract the object of the triple and store it in a dictionary, where the key is the word (in the example: tachimetro) and the value is the pos- part of speech: in the example "n" for noun.

so this is what i've done:

from lxml import etree
import re
parser = etree.XMLParser(encoding="utf-8")
regex = re.compile(r'^ita-(?P<word>[A-Za-z+]+)-(?P<pos>[anrv]{1})$')
doc = etree.parse('wn-ita-lemon.xml',parser=parser)
italian_vocabolary = {}
for df in doc.xpath('//lemon:LexicalEntry',namespaces={'lemon':'http://lemon-model.net/lemon#'}):
         for k,v in df.attrib.items():
                    rx = re.search(regex,v)
                    if rx is not None:
                        ita_vocabolary[rx.group('word')] = rx.group('pos')
                    else:
                        print(v) #to check the value

now the string are of two kind basically; single word like the esample above, and expression like ita-Locusta+migratoria-n (for that i put a + in the class regex).

Now there is some words that the regex doesnt retrieve, and they are accented word like: ita-sentenziosit%C3%A0-n it should be ita-sentenziosità-n

The xml file didn' have the doctype, i inserted later:

<?xml version="1.0" encoding="UTF-8"?>

but it didnt work anyway, even with given the correct encoding to the etree parser.

[A-Za-z+] will match capital or lowercase A-Z characters, or the plus sign. It won't match accent characters. Probably not what you want. See this post: https://stackoverflow.com/questions/20690499/concrete-javascript-regex-for-accented-characters-diacritics — Alex von Brandenfels, Nov 18 '17 at 00:21
change the regex with: ^ita-(?P[A-Za-z+\']+[àèìòùéó]?)-(?P[anrv]{1})$ but still not work, i think is an encodig problem of the xml file. thanks anyway — Giacomo Ciampoli, Nov 18 '17 at 00:39
The problem might be that there are multiple ways of encoding accented letters. You could have [the unicode character for é](http://www.fileformat.info/info/unicode/char/e9/index.htm), or the [normal e character](http://www.fileformat.info/info/unicode/char/0065/index.htm) followed by the [unicode combining character for an acute accent](http://www.fileformat.info/info/unicode/char/0301/index.htm). So your regex will have to account for both of those — Alex von Brandenfels, Nov 18 '17 at 00:43
well the %C3%A0 char is the unicode bytes for à: http://www.i18nqa.com/debug/utf8-debug.html maybe i can try to replace each char with the correct accent character, since in the italian language i need only a limited number, do you think this would work? — Giacomo Ciampoli, Nov 18 '17 at 00:59
Hmm, if it's already encoded as %C3 %A0 in your file, I don't think you need to replace anything. I'm not sure what the problem is. — Alex von Brandenfels, Nov 18 '17 at 01:01
Maybe try converting to [unicode normalization form C](http://unicode.org/reports/tr15/#Norm_Forms)? Not sure if that will help, or how hard it would be — Alex von Brandenfels, Nov 18 '17 at 01:03
its strange: in the xml file i have ita-silenziosità-n, but when the python parse read it become ita-silenziosit%C3%A0-n, so the regex doesnt catch it — Giacomo Ciampoli, Nov 18 '17 at 01:03

accented characters in xml file: handling with python element.etree

0 Answers0