0
#!/usr/bin/env python3
import glob
import xml.etree.ElementTree as ET
filenames = glob.glob("C:\\Users\\####\\Desktop\\BNC2\\[A00-ZZZ]*.xml")
out_lines = []
for filename in filenames:
    with open(filename, 'r', encoding="utf-8") as content:
        tree = ET.parse(content)
        root = tree.getroot()
        for w in root.iter('w'):
            lemma = w.get('hw')
            pos = w.get('pos')
            tag = w.get('c5')

            out_lines.append(w.text + "," + lemma + "," + pos + "," + tag)

with open("C:\\Users\\####\\Desktop\\bnc.txt", "w") as out_file:
    for line in out_lines:
        line = bytes(line, 'utf-8').decode('utf-8', 'ignore')
        out_file.write("{}\n".format(line))

Gives the error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 0: character maps to undefined

I thought this line would have solved that:

line = bytes(line, 'utf-8').decode('utf-8', 'ignore')
vvvvv
  • 25,404
  • 19
  • 49
  • 81
pglove
  • 133
  • 1
  • 9
  • Have you tried `open("C:\\Users\\####\\Desktop\\bnc.txt", "w", encoding='utf8')` ?? – juanpa.arrivillaga Apr 22 '18 at 05:10
  • Please post the entire traceback. Python told you what line had the problem... pay it forward! Don't make us guess. – tdelaney Apr 22 '18 at 05:36
  • 2
    All `line = bytes(line, 'utf-8').decode('utf-8', 'ignore')` did was encode in utf-8 and decode again. You get the original string back. The problem I suspect is when you try to write to the ascii file. – tdelaney Apr 22 '18 at 05:50

2 Answers2

2

You need to specify the encoding when opening the output file, same as you did with the input file:

with open("C:\\Users\\####\\Desktop\\bnc.txt", "w", encoding="utf-8") as out_file:
    for line in out_lines:
        out_file.write("{}\n".format(line))
Ry-
  • 218,210
  • 55
  • 464
  • 476
-2

If your script have multiple reads and writes and you want to have a particular encoding ( let's say utf-8) for all of them, we can change the default encoding too

import sys
reload(sys)
sys.setdefaultencoding('UTF8')

We should use it only when we have multiple reads/writes though and should be done at the beginning of the script

Changing default encoding of Python?

Anant Gupta
  • 1,090
  • 11
  • 11
  • 1
    This doesn't work. (https://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script). Its so bad that `setdefaultencoding` has been disappeared. You won't find it in 3.6 unless you do some hacky stuff. – tdelaney Apr 22 '18 at 05:46
  • Yes, but when i have to reading to do, i have used it sometimes. Have not tried it with Python 3.6 though – Anant Gupta Apr 22 '18 at 06:05
  • Changing the default is bad. It breaks other modules that expect the default to be "the default". That's why the function doesn't work unless you do the reload "trick". Just open the file with the correct encoding! – Mark Tolonen Apr 23 '18 at 03:14