reading special characters from web in python

Question

I am scraping an xml webpage for names of people via RE searching, however if the names contain special characters python is not reading them correctly. For Example:

Güngüneş A

comes out as:

G\xc3\xbcng\xc3\xbcne\xc5\x9f A

How can I make this format correctly in my output?

It is correct but probably in unicode, how and where are you outputting it? — Paulo Bu, Jun 27 '13 at 15:30
@MarlenaDuda My mistake, seems utf-8 indeed. Still, how and where are you planning to output this? — Paulo Bu, Jun 27 '13 at 15:43

Elazar · Answer 1 · 2013-06-27T15:39:41.837

0

use decode():

>>> b'G\xc3\xbcng\xc3\xbcne\xc5\x9f A'.decode()
'Güngüne\u015f A'

(My machine has problems with 'ş')

edited Jun 27 '13 at 15:39

answered Jun 27 '13 at 15:30

Elazar

20,415
4
46
67

I tried `auth = auth.decode("utf-8")` but this simply turns 'G\xc3\xbcng\xc3\xbcne\xc5\x9f A' to u'G\xfcng\xfcne\u015f A' (changes is a bit and puts a u in front of the string) – Marlena Duda Jun 27 '13 at 15:35

score 0 · Answer 2 · edited May 23 '17 at 10:32

0

How are you reading these in? What OS are you using? Python 2 or 3? When I run,

myStr = 'G\xc3\xbcng\xc3\xbcne\xc5\x9f A'
print myStr

I get, 'Güngüneş A'.

Further, when I make a test file with the contents, 'Güngüneş A' and run,

mystr = open('test', 'r').read()
print mystr

I get 'Güngüneş A'.

I'm using ubuntu 10.04/python 2.6 and can't reproduce the problem with the information you've provided, if you post the actual code you're using it might help. That said, you could try specifying the type of string:

myStr = 'String'
myStr = u'Unicode string'
myStr = r'String literal: escape characters ignored'

Or, if you want to include unicode characters in your code, you can add this line to the beginning of your file as stated in this answer:

# -*- coding: utf-8 -*-

edited May 23 '17 at 10:32

Community

1
1

answered Jun 27 '13 at 15:53

Will

380
4
14

1

I use Windows and behaves exactly as the OP describes. Windows and it's defaults `code pages` ... :( Your console most certainly is configured to utf-8 that's why you read them easily. – Paulo Bu Jun 27 '13 at 15:55
That makes sense - I guess I'll leave the answer up in case the latter part helps, but good to know. – Will Jun 27 '13 at 15:56

reading special characters from web in python

2 Answers2