0

I was reading this: python: open and read a file containing germanic umlaut as unicode

I'm reading my dataframe from a CSV file, using pd.read_csv()

The \x9f should be an umlaut:

'Heiner Dr\x9fke "Weil, Gotshal & Manges"'

I tried to no avail:

person1.encode('utf-8')

UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 9: ordinal not in range(128)

TRIED

I get this when i use macroman person1.decode('macroman')
Out[511]:
u'Heiner Dr\xfcke "Weil, Gotshal & Manges"'

However, when I print person1.decode('macroman') does print out the umlaut. How do I capture this into a string?

person1.decode("cp1251")
Out[512]:
u'Heiner Dr\u045fke "Weil, Gotshal & Manges"'
Community
  • 1
  • 1
user3314418
  • 2,903
  • 9
  • 33
  • 55

2 Answers2

4

somehow you are encoded to macroman ... you shouldnt be

>>> print 'Heiner Dr\x9fke "Weil, Gotshal & Mages"'.decode("macroman")
Heiner Drüke "Weil, Gotshal & Mages"

this will decode it to unicode that python understands ...

if you want to encode it for a google search

'Heiner Dr\x9fke "Weil, Gotshal & Mages"'.decode("macroman").encode('ascii', 'xmlcharrefreplace')

should work fine

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • I get this when i use macroman person1.decode('macroman') Out[511]: u'Heiner Dr\xfcke "Weil, Gotshal & Manges"' – user3314418 Jun 27 '14 at 00:56
  • You have to `print person1.decode('macroman')` to see the character in Python 2.X. `u'\xfc'` is the Unicode hex escape character for `ü`. – Mark Tolonen Jun 27 '14 at 01:36
  • is there a way to capture the 'ü' for search? I'm trying to use that string to search on google, but I can only use 'Dr\xfcke' – user3314418 Jun 27 '14 at 01:45
  • see update answer that should give you a path forward ... you did not specify that criteria however in your initial question... but this method should work fine – Joran Beasley Jun 27 '14 at 02:50
  • cause I googled umlaut \x9f and found it to be that encoding ... it encodes it to ascii using xml character replacement on any non-ascii characters ... in otherwords it just works ;P – Joran Beasley Jun 27 '14 at 03:56
1

u = u"profileDir_(\u00fc)" (u umlaut) according to this reference

Rachel Gallen
  • 27,943
  • 21
  • 72
  • 81
  • I dont think this really answers his question ... his string is encoded as macRoman ... when properly decoded he does get his desired unicode result – Joran Beasley Jun 27 '14 at 02:52