5

I know there are tons of threads regarding this issue but I have not managed to find one which solves my problem.

I am trying to print a string but when printed it doesn't show special characters (e.g. æ, ø, å, ö and ü). When I print the string using repr() this is what I get:

u'Von D\xc3\xbc' and u'\xc3\x96berg'

Does anyone know how I can convert this to Von Dü and Öberg? It's important to me that these characters are not ignored, e.g. myStr.encode("ascii", "ignore").

EDIT

This is the code I use. I use BeautifulSoup to scrape a website. The contents of a cell (<td>) in a table (<table>), is put into the variable name. This is the variable which contains special characters that I cannot print.

web = urllib2.urlopen(url);
soup = BeautifulSoup(web)
tables = soup.find_all("table")
scene_tables = [2, 3, 6, 7, 10]
scene_index = 0
# Iterate over the <table>s we want to work with
for scene_table in scene_tables:
    i = 0
    # Iterate over < td> to find time and name
    for td in tables[scene_table].find_all("td"):
        if i % 2 == 0:  # td contains the time
            time = remove_whitespace(td.get_text())
        else:           # td contains the name
            name = remove_whitespace(td.get_text()) # This is the variable containing "nonsense"
            print "%s: %s" % (time, name,)
        i += 1
    scene_index += 1
simonbs
  • 7,932
  • 13
  • 69
  • 115

3 Answers3

12

Prevention is better than cure. What you need is to find out how that rubbish is being created. Please edit your question to show the code that creates it, and then we can help you fix it. It looks like somebody has done:

your_unicode_string =  original_utf8_encoded_bytestring.decode('latin1')

The cure is to reverse the process, simply, and then decode.

correct_unicode_string = your_unicode_string.encode('latin1').decode('utf8')

Update Based on the code that you supplied, the probable cause is that the website declares that it is encoded in ISO-8859-1 (aka latin1) but in reality it is encoded in UTF-8. Please update your question to show us the url.

If you can't show it, read the BS docs; it looks like you'll need to use:

BeautifulSoup(web, from_encoding='utf8')
John Machin
  • 81,303
  • 11
  • 141
  • 189
  • I have updated my question to show the code which I use. I use BeautifulSoup to scrape a website. Then the contents of a cell in a table, is thrown into the variable `name`. This is the variable which contains special characters that I cannot print. – simonbs Apr 02 '12 at 10:20
  • Using `name.encode('latin1').decode('utf8')` solves all my issues. The characters looks perfect but you say this is not the right way to do it? – simonbs Apr 02 '12 at 10:24
  • @SimonBS: Re-read the first sentence of my answer. It's always better to *understand* your real problem and fix it at the source, not downstream. That encode/decode is merely reversing out the underlying problem. – John Machin Apr 02 '12 at 10:38
  • `BeautifulSoup(web, from_encoding='utf8')` did the trick. Thank you very much! – simonbs Apr 02 '12 at 10:55
3

Unicode support in many languages is confusing, so your error here is understandable. Those strings are UTF-8 bytes, which would work properly if you drop the u at the front:

>>> err = u'\xc3\x96berg'
>>> print err
Ã?berg
>>> x = '\xc3\x96berg'
>>> print x
Öberg
>>> u = x.decode('utf-8')
>>> u
u'\xd6berg'
>>> print u
Öberg

For lots more information:

http://www.joelonsoftware.com/articles/Unicode.html

http://docs.python.org/howto/unicode.html


You should really really read those links and understand what is going on before proceeding. If, however, you absolutely need to have something that works today, you can use this horrible hack that I am embarrassed to post publicly:

def convert_fake_unicode_to_real_unicode(string):
    return ''.join(map(chr, map(ord, string))).decode('utf-8')
A B
  • 8,340
  • 2
  • 31
  • 35
  • When I print the strings without `repr()`, this is what I get: `Ãberg` but what I would like to have is `Öberg`. If I use `decode('utf-8')`, I will get a `UnicodeEncodeError`. If the strings are UTF-8, shouldn't it write a `Ö` instead of `Ã`? – simonbs Apr 02 '12 at 09:29
  • 1
    You'll want to figure out how those variables got to be of type `unicode` in the first place. They're actually UTF-8 encoded in ascii, so they should properly be of type `str`. – A B Apr 02 '12 at 09:31
  • -1 for (1) the join/map/chr/map/ord mess (2) "UTF-8 encoded in ascii" – John Machin Apr 02 '12 at 10:19
1

The contents of the strings are not unicode, they are UTF-8 encoded.

>>> print u'Von D\xc3\xbc'
Von Dü
>>> print 'Von D\xc3\xbc'
Von Dü

>>> print unicode('Von D\xc3\xbc', 'utf-8')
Von Dü
>>> 

Edit:

>>> print '\xc3\x96berg' # no unicode identifier, works as expected because it's an UTF-8 encoded string
Öberg
>>> print u'\xc3\x96berg' # has unicode identifier, means print uses the unicode charset now, outputs weird stuff
Ãberg

# Look at the differing object types:
>>> type('\xc3\x96berg')
<type 'str'>
>>> type(u'\xc3\x96berg')
<type 'unicode'>

>>> '\xc3\x96berg'.decode('utf-8') # this command converts from UTF-8 to unicode, look at the unicode identifier in the output
u'\xd6berg'
>>> unicode('\xc3\x96berg', 'utf-8') # this does the same thing
u'\xd6berg'
>>> unicode(u'foo bar', 'utf-8') # trying to convert a unicode string to unicode will fail as expected
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: decoding Unicode is not supported
Fabian
  • 4,160
  • 20
  • 32
  • When I print the strings without `repr()`, this is what I get: `Ãberg` but what I would like to have is `Öberg`. If the strings are UTF-8, shouldn't it write a `Ö` instead of `Ã`? If I use `unicode`, I get the following error: `TypeError: decoding Unicode is not supported`. – simonbs Apr 02 '12 at 09:31
  • You still use the unicode identifier (`u'foo'`). It's an UTF-8 encoded string and by using the unicode identifier, you say it's unicode where it's not. That's why you get `Ã` instead of `Ö`. Drop the identifier and you'll be fine. I'll update my answer to make it clear. – Fabian Apr 02 '12 at 09:35
  • @SimonBS I updated my answer. You should still read this link: http://docs.python.org/howto/unicode.html – Fabian Apr 02 '12 at 09:42
  • I just read the link. I am still a bit confused, though. I have my string, `myStr`, which is of type `unicode`, meaning it has the unicode identifier. I want to remove this identifier and have a UTF-8 encoded string. How would I do this? I had thought it would be has simple as `myStr.encode("utf-8")` which returns an object of type `str` but this throws a `UnicodeDecodeError` error. – simonbs Apr 02 '12 at 10:11
  • @SimonBS That should work. Can you post that example in your question or at http://pastebin.com ? – Fabian Apr 02 '12 at 10:16
  • -1 "Those strings are not unicode" -- repr(those_strings) has a `u` out the front; they ARE unicode, they're botched unicode. He has DATA, not source-code literals. The `u` is put there by repr(); he can't "drop the identifier". – John Machin Apr 02 '12 at 10:17
  • @JohnMachin True. The string itself is unicode, but the content in it isn't. By "drop the identifier" I meant that he shouldn't treat the string as unicode, because the text in it is UTF-8. I updated my answer. – Fabian Apr 02 '12 at 10:22