1

Using Python 2, I am saving strings from a variable (which is out of an xml tag) and am storing it into a list.

First: the strings contain special character, when I print them they don't correctly show up even that am using encode("ISO-8859-1")

Second: The strings show up each one in a list and I want them to be in the same list

import lxml.objectify
from lxml import etree
import codecs
import xml.etree.cElementTree as ET
file_path = "C:\Users\HP\Downloads\Morphalou-2.0.xml"
for event, elem in ET.iterparse(file_path, events=("start", "end")):
    if elem.tag == 'orthography' and event =='start':
        data = elem.text
        my_list = []
        if data is not None :
            for i in data.split('\n'):
                my_list.append(i.encode("ISO-8859-1"))
            print (my_list)

This is what Am getting

['abiotique']
['abiotiques']
[u'abi\xe9tac\xe9e']
[u'abi\xe9tac\xe9e']
[u'abi\xe9tac\xe9es']
[u'abi\xe9tin']
[u'abi\xe9tin']
[u'abi\xe9tins']
[u'abi\xe9tine']
[u'abi\xe9tines']

This is what am expecting:

['abiotique','abiotiques','abiétacée',...]

Does anyone know how to fix this ? Thanks

martineau
  • 119,623
  • 25
  • 170
  • 301
Ran
  • 635
  • 2
  • 10
  • 22
  • Related https://stackoverflow.com/a/47882550/5320906 – snakecharmerb Dec 28 '17 at 17:15
  • One file: Morphalou-2.0.xml – Ran Dec 28 '17 at 17:16
  • It's a shame that you're forced to use Python 2, the Unicode handling in Python 3 is a lot saner. In the mean time, you may find this article helpful: [Pragmatic Unicode](http://nedbatchelder.com/text/unipain.html), which was written by SO veteran Ned Batchelder. – PM 2Ring Dec 28 '17 at 17:49

1 Answers1

1

Python3 handles this automatically, you don't need to use encode.
As for the list, you're creating a new one with each iteration, create it above the loop, and print it after iterating over the XML elements has finished.

Working example (I've added the word abiétacée to an XML a bunch of times to reproduce your situation):

my_list = []
for event, elem in ET.iterparse(file_path, events=("start", "end")):
    if elem.tag == 'orthography' and event =='start':
        data = elem.text
        if data is not None :
            for i in data.split('\n'):
                my_list.append(i)
print (my_list)

outputs

['abiétacée', 'abiétacée', 'abiétacée', 'abiétacée']

Evya
  • 2,325
  • 3
  • 11
  • 22
  • Thanks @Artier, fixed it. As for Python2.7, I'll try going over the docs to find something helpful – Evya Dec 28 '17 at 17:28