3

I have huge csv files and they contain '\xc3\x84' style characters instead of German umlauts, because I scrapped HTML using BeautifulSoup and wrote it in the csv files using Python 2.7.8.

I managed to replace all those characters with the help of this: Python 2.7.1: How to Open, Edit and Close a CSV file

and now my code looks like this:

import csv

new_rows = []
umlaut = {'\\xc3\\x84': 'Ä', '\\xc3\\x96': 'Ö', '\\xc3\\x9c': 'Ü', '\\xc3\\xa4': 'ä', '\\xc3\\xb6': 'ö', '\\xc3\\xbc': 'ü'}

with open('file1.csv', 'r') as csvFile:
    reader = csv.reader(csvFile)
    for row in reader:
        new_row = row
        for key, value in umlaut.items():
            new_row = [ x.replace(key, value) for x in new_row ]
        new_rows.append(new_row)

with open('file2.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerows(new_rows)

When I open the csv I see Köln instead of Köln and other "German umlaut" problems. I can solve this problem manually by opening the CSV file with notepad and then save it as UTF-8, but I want to do it automated with python.

I do not quite get how to use the UnicodeWriter:

https://docs.python.org/2/library/csv.html#examples

The answers and solutions I found here on stackoverflow are all a little bit complicated.

My question are, how would I use for example the UnicodeWriter right in my case? Do you know any super easy function that does something like file2.encode('utf-8')? If such an easy like function doesn' t exist in Python, then why doesn't it exists yet, because encoding errors are very common?

Community
  • 1
  • 1
dima
  • 158
  • 3
  • 14
  • 1
    you realise the encoding of where you are opening the file is the issue? `'\xc3\x84'` is a utf-8 encoded string – Padraic Cunningham Dec 13 '15 at 02:54
  • 2
    I think the file already is utf-8 encoded. `'\\xc3\\x84'` is the utf-8 encoding of `'Ä'` so it doesn't make a lot of sense to replace one with the other. When you _"open the csv I see Köln"_ how are you opening? With Notepad? I think its decoding using your local code page instead of utf-8. Microsoft includes an encoding hint called the BOM in its files but beautiful soup doesn't. Can you post your encoding (`print sys.stdin.encoding`) so I can try it. And also, does `print codecs.open('file1.csv', encoding='utf-8').read()` print the characters correctly? if so, you are already utf-8. – tdelaney Dec 13 '15 at 03:24
  • After using print sys.stdin.encoding the output was "cp850" in my console – dima Dec 13 '15 at 14:48

2 Answers2

2

Instead of using your own mapping, you can use string-escape encoding:

>>> print '\\xc3\\x84'.decode('string-escape')
Ä

import csv

def iter_decode(it):
    for line in it:
        yield line.decode('string-escape')

with open('file1.csv') as csvFile, open('file2.csv', 'w') as f:
    reader = csv.reader(iter_decode(csvFile))
    writer = csv.writer(f)
    for row in reader:
        writer.writerow(row)
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • 1
    I tried out your solution suggestion, but it didn' t do the trick. I think it is, because the default encoding of Excel is ansi. You propably face the same problem as my code does, because the writing part needs to be done with something like the UnicodeWriter. – dima Dec 13 '15 at 02:50
  • to falsetru: then my mapping was working, too? And it is really just about how and where I open the files? @Padriac Cunningham, I do not get the same output, when I try print '\xc3\x84' , my console still prints weird signs. – dima Dec 13 '15 at 03:26
  • 2
    @dima, that is because of the encoding of your shell is not utf-8, that is a windows issue not python – Padraic Cunningham Dec 13 '15 at 03:27
2

Given that you have a unicode writer from the docs :

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

use it like so:

from __future__ import unicode_lterals
import codecs
f = codecs.open("somefile.csv", mode='w', encoding='utf-8')
writer = UnicodeWriter(f)
for data in some_buffer:
    writer.writerow(data)
fiacre
  • 1,150
  • 2
  • 9
  • 26
  • 5
    You should attribute/link the [docs](https://docs.python.org/2/library/csv.html#examples) where you pulled `UnicodeWriter` from. – roippi Dec 13 '15 at 02:35
  • It's not a good idea to recommend `from __future__ import unicode_lterals`. It will confuse users when they ask for further help. – Alastair McCormack Dec 13 '15 at 13:08