0

I am using Python 2.7.3 and BeuatofulSoup to grab data from a website's table, then using codecs to write content to a file. One of the variables I collect, occasionally has garbled characters in it. For example, if the website table looks like this

 Year    Name   City             State
 2000    John   D’Iberville    MS
 2001    Steve  Arlington        VA

So when I generate my City variable, I always encode it as utf-8:

 Year = foo.text
 Name = foo1.text
 City = foo3.text.encode('utf-8').strip()
 State = foo4.text

 RowsData = ("{0},{1},{2},{3}").format(Year, Name, City, State)

So that the contents of a list of comma separated strings I create called RowData and RowHeaders look like this

 RowHeaders = ['Year,Name,City,State']

 RowsData = ['2000, John, D\xc3\xa2\xe2\x82\xac\xe2\x84\xa2Iberville, MS', 
            '2001, Steve, Arlington, VA']

Then I attempt to write this to a file using the following code

 file1 = codecs.open(Outfile.csv,"wb","utf8")
 file1.write(RowHeaders + u'\n')
 line = "\n".join(RowsData)
 file1.write(line + u'\r\n')
 file1.close()

and I get the following error

 Traceback (most recent call last):  
     File "HSRecruitsFBByPosition.py", line 141, in <module>
       file1.write(line + u'\r\n')

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6879: ordinal not in range(128)

I can use the csv writer package on RowsData and it works fine. For reasons that I don't want to get into, I need to use codecs to output the csv file. I can't figure out what is going on. Can anyone help me fix this issue? Thanks in advance.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Mark Clements
  • 465
  • 7
  • 25

2 Answers2

1

codecs.open() encodes for you. Don't hand it encoded data, because then Python will try and decode the data for you again just so it can encode it to UTF-8. That implicit decoding uses the ASCII codec, but since you have non-ASCII data in your encoded byte string, this fails:

>>> u'D’Iberville'.encode('utf8')
'D\xc3\xa2\xe2\x82\xac\xe2\x84\xa2Iberville'
>>> u'D’Iberville'.encode('utf8').encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

The solution is to *not encode manually:

Year = foo.text
Name = foo1.text
City = foo3.text.strip()
State = foo4.text

Note that codecs.open() is not the most efficient implementation of a file stream. In Python 2.7, I'd use io.open() instead; it offers the same functionality, but implemented more robustly. The io module is the default I/O implementation for Python 3, but also available in Python 2 for forward compatibility.

However, you appear to be re-inventing CSV handling; Python has an excellent csv module that can produce CSV files for you. In Python 2 it cannot handle Unicode however, so then you do need to encode manually:

import csv

# ...

year = foo.text
name = foo1.text
city = foo3.text.strip()
state = foo4.text

row = [year, name, city, state]

with open(Outfile.csv, "wb") as outf:
    writer = csv.writer(outf)
    writer.writerow(['Year', 'Name', 'City', 'State'])
    writer.writerow([c.encode('utf8') for c in row])

Last but not least, if your HTML page produced the text D’Iberville then you produced a Mojibake; one where you misinterpreted UTF-8 as CP-1252:

>>> u'D’Iberville'.encode('cp1252').decode('utf8')
u'D\u2019Iberville'
>>> print u'D’Iberville'.encode('cp1252').decode('utf8')
D’Iberville

This is usually caused by bypassing BeautifulSoup's encoding detection (pass in byte strings, not Unicode).

You could try and 'fix' these after the fact with:

try:
    City = City.encode('cp1252').decode('utf8')
except UnicodeError:
    # Not a value that could be de-mojibaked, so probably
    # not a Mojibake in the first place.
    pass
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I forgot to add this line in the code, but the reason I encode manually is because if I don't the `.format()` function throws me an error: `UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-3: ordinal not in range(128)`. Is there a way around this? – Mark Clements Aug 20 '14 at 07:33
  • @MarkClements: don't mix byte strings and Unicode values when using `str.format()`. Use a `u'...'` string literal for the format instead. – Martijn Pieters Aug 20 '14 at 07:35
  • @MarkClements: also, why reinvent CSV handling and not use the `csv` module here? – Martijn Pieters Aug 20 '14 at 07:36
  • I'm not sure what you mean by "use a `u'...'` sting literal instead." Can you give me an example? – Mark Clements Aug 20 '14 at 07:38
  • @MarkClements: instead of `'some format {}'.format(...)` use `u'some format {}'.format(...)`. – Martijn Pieters Aug 20 '14 at 07:41
  • @Martijin Thanks! That worked great. However, the vale that is written to the file shows up as `D’Iberville`. What's the simplest way to get the value for `u'D’Iberville'` to output correctly as `D'Iberville`. I need to have some way of doing this in general, since the garbled text is a mistake that randomly shows up on the website. – Mark Clements Aug 20 '14 at 07:48
  • @MarkClements: you are reading the UTF-8 as CP-1252 again. *Open the file as UTF-8*. You are creating a Mojibake from a Mojibake. – Martijn Pieters Aug 20 '14 at 07:49
  • I am opening the file as a UTF-8 in the line `open(Outfile.csv,"wb","utf8")` aren't I? Sorry, I'm not understanding. – Mark Clements Aug 20 '14 at 07:52
  • @MarkClements: If the mistakes show up randomly, you may have a harder time correcting it again. I've added an option to de-mojibake this, but there could be false positives. These should be rare, however. – Martijn Pieters Aug 20 '14 at 07:52
  • @MarkClements: How are you verifying that the value that is written shows up as `D’Iberville` then? – Martijn Pieters Aug 20 '14 at 07:53
  • Oh, I was just opening the file in excel and that's what the value shows up as. When would the code `City.encode('cp1252').decode('utf8')` give a `UnicodeError`? Won't it leave my `City` variable unchanged unless `City` on the website is garbled in the first place, in which case, it fixes it? I'm just trying to understand why the try/except is needed. Thanks again. – Mark Clements Aug 20 '14 at 07:59
  • @MarkClements: Excel opens files using you system codepage unless told otherwise. See [Is it possible to force Excel recognize UTF-8 CSV files automatically?](http://stackoverflow.com/q/6002256) – Martijn Pieters Aug 20 '14 at 08:01
  • @MarkClements: When `City` is *not* mangled, then either `.encode('cp1252')` fails, or the `.decode('utf8')` fails (both would throw subclasses of `UnicodeError`). Unless it is ASCII text that is, in which case it'll work just fine without changing anything. Either way, mission accomplished, only mangled text is transformed. – Martijn Pieters Aug 20 '14 at 08:03
  • Ok. I'm pretty sure everything on the website is ascii anyway. Thanks so much for your help, I really appreciate it. – Mark Clements Aug 20 '14 at 08:05
0

This 'D\xc3\xa2\xe2\x82\xac\xe2\x84\xa2Iberville' is a normal string, that happens to have escaped bits representing characters.

So, to write it out, you need to decode it first. Since you haven't given a decoding, Python is trying ASCII and failing.

>>> s
'D\xc3\xa2\xe2\x82\xac\xe2\x84\xa2Iberville'
>>> type(s)
<type 'str'>
>>> type(s.decode('utf-8'))
<type 'unicode'>
>>> print(s.decode('utf-8'))
D’Iberville

Here's how to understand this process:

  1. First, understand that characters are for humans, bytes are for computers. Computers are just doing us a favor converting bytes to characters so we can understand the data.

  2. So, anytime you need to store something for the computer's benefit you need to convert it from characters to bytes, since that's what the computer knows. All files (even text files) are bytes. Just that as soon as you open it up, there is conversion of this byte data into characters so that we can understand its contents. In the case of "binary" files (like an image, or a Word document), this process is a bit different.

  3. If we are writing "text" content, we need to take the glyphs (the characters) and convert them into bytes so that the file can be written. This process is called decoding.

  4. When we want to "read" a text file, that is convert the bytes into glyphs (the characters, or alphabet) we need to encode the bits - in effect, translate them. To know what glyph corresponds to the bits stored, we use a lookup table this table name (utf-8) is what you pass in.

Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284