How to tokenize a sentence in the XML and create new child nodes?

Asked May 08 '22 at 09:39

Active May 08 '22 at 14:26

Viewed 118 times

I have XML which looks like this:

<para id="0">
    <se lang="hi">काकेशिया में तब लड़ाई</se>
    <se lang="ru">потом боевые действия на Кавказе</se>
</para>
<para id="1">
...
</para>
<para id="2">
...
</para>

and I want to tokenize the devanagari text by using iNLTK library and get a file which looks like this:

<para id="0">
    <se lang="hi">
        <w>काकेशिया</w> 
        <w>में</w> 
        <w>तब</w>
        <w>लड़ाई</w> 
    </se>
    <se lang="ru">потом боевые действия на Кавказе</se>
</para>
<para id="1">
...
</para>
<para id="2">
...
</para>

I understand how to tokenize the sentence:

paras = body.getElementsByTagName('para')
for para in paras:
    devanagari = para.getElementsByTagName('se')[1].childNodes[0].nodeValue
    print(tokenize(devanagari, 'hi'))

but what I don't know is how to make childnodes xml <w>...</w> for each word and write it into the XML

How can I do that by using xml.etree.ElementTree?

edited May 08 '22 at 14:26

asked May 08 '22 at 09:39

Varvara Smirnova

1

Have you tried anything? What is the problem? – mzjn May 08 '22 at 09:47
Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community May 08 '22 at 13:30
XSLT 2.0 and later: https://stackoverflow.com/questions/11487704/how-to-make-xsl-tokenize-work – Yitzhak Khabinsky May 08 '22 at 14:36

How to tokenize a sentence in the XML and create new child nodes?

0 Answers0