2

Basically I'm trying to convert all lines in the xml file which contain "Account" in lower case and write it back to file.

My XML looks like this:

<?xml version="1.0" encoding="UTF-8"?>

<TRAINERERADMINSTRATIONOBJECTS>
  <TRAINERLIST>
    <TRAINER>
      <Account>Täst</Account>
      <Mark>pUIPBPp8TWw=</Mark>
      <Type>lala</Type>
      <Business>sghs</Business>
    </TRAINER>
  </TRAINERLIST>
</TRAINERADMINSTRATIONOBJECTS>

As you can see the source file is in UTF-8! There are "äüö" in the "Account" line.

My Code:

# -*- coding: utf-8 -*-
with open("buht.xml", "r") as s:
    for line in s:
        if 'Account' in line:
            s = open("buht.xml").read().decode('utf-8')
            s = s.replace(line, line.decode('utf-8').lower())
            f = open("buht.xml", 'w')
            f.write(s)
            f.close()

With this error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4'

I tried to decode (and desperately encode) it anywhere in the script, but it doesn't work.

QuestionA
  • 55
  • 6
  • Have you checked [this question](http://stackoverflow.com/questions/26592988/unicodeencodeerror-ascii-codec-cant-encode-character-u-xe4) ? – jlnabais Sep 25 '15 at 13:42
  • No, but in the original source file there are several other errors like 'u'\xdc' or something ... I will check the question anayway, thanks! – QuestionA Sep 25 '15 at 13:48
  • 1
    Your code has several problems but the immediate issue is that you don't encode the string to UTF8 before writing. – tdelaney Sep 25 '15 at 14:06
  • 2
    You _really_ need to mention which Python version you are using because Unicode handling in Python 2 is quite different to Python 3. – PM 2Ring Sep 25 '15 at 14:09
  • Well, the code has many issues, I'm new in python, so I'm trying to learn ... I'm working with Python2.7 .... I'll try to encode it before writing it back, thanks so far! I also tried it in another way, read file in as xml and lower the specific value. Unfortunately he deleted all the endtags of empty tags, which is also not good ... I really tried a lot before asking here... – QuestionA Sep 25 '15 at 15:12

3 Answers3

1

As tdelaney mentions in the comments there are several problems with your code. You opened the file in read mode, but then in your for loop you attempt to open it again when you detect 'Account' in the current line, and then you try to open it again in write mode; that's not going to work.

There are various ways to do what you want, but here's a solution that works in Python 2.6. I've saved your sample data into a UTF-8 encoded file named "test.xml".

import re

iname = "test.xml"
oname = "test_out.xml"

pat = re.compile('(\s*<Account>)(.*?)(</Account>\s*)', re.U)

with open(iname, "rb") as fin:
    with open(oname, "wb") as fout:
        for line in fin:
            line = line.decode('utf-8')
            m = pat.search(line)
            if m:
                g = m.groups()
                line = g[0] + g[1].lower() + g[2]
            fout.write(line.encode('utf-8'))

contents of test_out.xml

<?xml version="1.0" encoding="UTF-8"?>

<TRAINERADMINSTRATIONOBJECTS>
  <TRAINERERLIST>
    <TRAINER>
      <Account>täst</Account>
      <Mark>pUIPBPp8TWw=</Mark>
      <Type>lala</Type>
      <Business>sghs</Business>
    </TRAINERER>
  </TRAINERLIST>
</TRAINERADMINSTRATIONOBJECTS>

I'm using a Regular Expression to find the target line & to do the replacement. Note that it's generally a very bad idea to use RegEx on XML / HTML, but you can get away with it if you can guarantee that the input data will always be in a simple, known format.

Community
  • 1
  • 1
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • Oh my god ... you made my day. THANK YOU! – QuestionA Sep 25 '15 at 15:26
  • @QuestionA: My pleasure! Since my answer has helped you, please consider [accepting](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) it. :) – PM 2Ring Sep 25 '15 at 15:30
1

You may want to use xml.etree.ElementTree to parse this xml file.

However, you may want to try :

with open("buht.xml", "rb") as s:
    for line in s:
        if 'Account' in line:
            line = line.decode('utf-8')

I'm not sure what your processing is about, but you may want to append these to a list, then write to a file later. It looks like you're trying to read and write to the same file at different times.

Also, using if 'Account' in line may return true if you have 'Account' in a different part of your line string as well. If you parse using xml.etree.ElementTree you won't have this problem.

Programmer
  • 293
  • 4
  • 15
  • I also tried it with element tree, it works nice, but there is another problem concerning the endtags, which are deletet if the tag is empty ... thanks for your help, I'll try it like this – QuestionA Sep 25 '15 at 15:18
0

An ElementTree solution:

from xml.etree import ElementTree as et
tree = et.parse('test.xml')
for e in tree.iterfind('.//Account'):
    e.text = e.text.lower()
tree.write('test_out.xml',encoding='UTF-8',xml_declaration=True)

From a comment in another answer, this will convert empty open/close tags such as <tag></tag> to a single, but still valid XML, <tag /> with the same meaning. It parses to the same content so should not worry you.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Oh okay, good to know that the XML is still valid although the syntax is different. Thus my version of code with ElementTree works even though yours is better and more efficient. Thanks! – QuestionA Sep 26 '15 at 20:29
  • Edit: I used your code and it worked well, only problem (that I had with my code too) : In case it is a large xml, I get this error: `ParseError: not well-formed (invalid token): line 2050, column 36` – QuestionA Sep 26 '15 at 20:41
  • You probably have a mismatched tag on that line...an open with no close or vice versa. Your original example has one, for example. – Mark Tolonen Sep 27 '15 at 02:20
  • Checked this but all tags are closed. I guess it's because of some characters like "&" or "§" and so on ... – QuestionA Sep 27 '15 at 07:28