0

I am scraping an xml webpage for names of people via RE searching, however if the names contain special characters python is not reading them correctly. For Example:

Güngüneş A

comes out as:

G\xc3\xbcng\xc3\xbcne\xc5\x9f A

How can I make this format correctly in my output?

2 Answers2

0

use decode():

>>> b'G\xc3\xbcng\xc3\xbcne\xc5\x9f A'.decode()
'Güngüne\u015f A'

(My machine has problems with 'ş')

Elazar
  • 20,415
  • 4
  • 46
  • 67
  • I tried `auth = auth.decode("utf-8")` but this simply turns 'G\xc3\xbcng\xc3\xbcne\xc5\x9f A' to u'G\xfcng\xfcne\u015f A' (changes is a bit and puts a u in front of the string) – Marlena Duda Jun 27 '13 at 15:35
0

How are you reading these in? What OS are you using? Python 2 or 3? When I run,

myStr = 'G\xc3\xbcng\xc3\xbcne\xc5\x9f A'
print myStr

I get, 'Güngüneş A'.

Further, when I make a test file with the contents, 'Güngüneş A' and run,

mystr = open('test', 'r').read()
print mystr

I get 'Güngüneş A'.

I'm using ubuntu 10.04/python 2.6 and can't reproduce the problem with the information you've provided, if you post the actual code you're using it might help. That said, you could try specifying the type of string:

myStr = 'String'
myStr = u'Unicode string'
myStr = r'String literal: escape characters ignored'

Or, if you want to include unicode characters in your code, you can add this line to the beginning of your file as stated in this answer:

# -*- coding: utf-8 -*-
Community
  • 1
  • 1
Will
  • 380
  • 4
  • 14
  • 1
    I use Windows and behaves exactly as the OP describes. Windows and it's defaults `code pages` ... :( Your console most certainly is configured to utf-8 that's why you read them easily. – Paulo Bu Jun 27 '13 at 15:55
  • That makes sense - I guess I'll leave the answer up in case the latter part helps, but good to know. – Will Jun 27 '13 at 15:56