1

I have a file encoded in utf-8 with polish characters. What I need to do is to play a bit with words. But when I use split(" ") list contains \xc5\x82 or \u0142

filename = 'patient.txt'
f = open(filename, 'r')
for line in f:
    print line
    print line.split(" ")
    print unicode(line,encoding(line),errors='ignore').split(" ")
f.close()

result:

   Pacjent lat 48 został przyjęty do Oddziału z powodu spadku tolerancji wysiłku i duszności.
['\xef\xbb\xbfPacjent', 'lat', '48', 'zosta\xc5\x82', 'przyj\xc4\x99ty', 'do', 'Oddzia\xc5\x82u', 'z', 'powodu', 'spadku', 'tolerancji', 'wysi\xc5\x82ku', 'i', 'duszno\xc5\x9bci.']
[u'Pacjent', u'lat', u'48', u'zosta\u0142', u'przyj\u0119ty', u'do', u'Oddzia\u0142u', u'z', u'powodu', u'spadku', u'tolerancji', u'wysi\u0142ku', u'i', u'duszno\u015bci.']

What do I need to do to have polish characters in a list? Is is possible at all?

Regards
Pawel

psmith
  • 1,769
  • 5
  • 35
  • 60

2 Answers2

2

You already do have Polish characters in the list. But when you print the list you only see its representation.

>>> print u'zosta\u0142'
został
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
0

Your file is actually not UTF-8 but UTF-8-BOM. Use open(filename, 'r', encoding='utf-8-bom')

user2722968
  • 13,636
  • 2
  • 46
  • 67