string Indices for a utf-8 word in a string

Question

With the following code, I am getting different values of indices in my mac and on ubuntu. Both are 64-bit machines and running python 2.7.8. The messages.json file has a string which has some utf-8 characters in the begining. The content of the file is:

  #Bangalore fine dinning table bookings in best price ⚡⚡⚡⚡⚡⚡⚡⚡⚡

The python code is as follows:

import re

f = open('messages.json', 'r')
text = f.read().decode('UTF-8')
f.close()

print type(text)

for m in re.finditer('#Bangalore', text): 
    s = m.start()
    e = m.end()
    print s, e
    print text[s:e]

On Ubuntu:

<type 'unicode'>
11 21
#Bangalore

On Mac:

<type 'unicode'>
20 30
#Bangalore

You may also consider [this approach using the `codecs` package](http://stackoverflow.com/a/844443/736937) — jedwards, Mar 13 '15 at 05:57

score 4 · Answer 1 · edited May 23 '17 at 10:27

The problem is that your string contains codepoints greater than 0xFFFF ("astral" characters). Python (prior to 3.3) comes in two versions: "narrow" and "wide". The narrow version only supports 16-bit unicodes, and requires two units for astrals:

Python 2.7.5 (default, Mar  9 2014, 22:15:05) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
65535
>>> s = u'#Bangalore'
>>> s.index('#')
2

"wide" builds use 32 bits and represent all unicode chars with one unit:

Python 2.7.2+ (default, Jul 20 2012, 22:15:08) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
1114111
>>> s = u'#Bangalore'
>>> s.index('#')
1

Possible workarounds are

use modern Python
install a wide python on OSX
rewrite the code so that it doesn't require absolute positions

Thank you! This is the reason, however I notice that in Android, where I am using the indices is compatible with narrow width indices. I will need to install narrow python .. — Pankaj Garg, Mar 13 '15 at 14:14

string Indices for a utf-8 word in a string

1 Answers1