2

With the following code, I am getting different values of indices in my mac and on ubuntu. Both are 64-bit machines and running python 2.7.8. The messages.json file has a string which has some utf-8 characters in the begining. The content of the file is:

  #Bangalore fine dinning table bookings in best price ⚡⚡⚡⚡⚡⚡⚡⚡⚡

The python code is as follows:

import re

f = open('messages.json', 'r')
text = f.read().decode('UTF-8')
f.close()

print type(text)

for m in re.finditer('#Bangalore', text): 
    s = m.start()
    e = m.end()
    print s, e
    print text[s:e]

On Ubuntu:

<type 'unicode'>
11 21
#Bangalore

On Mac:

<type 'unicode'>
20 30
#Bangalore
Pankaj Garg
  • 893
  • 2
  • 8
  • 15

1 Answers1

4

The problem is that your string contains codepoints greater than 0xFFFF ("astral" characters). Python (prior to 3.3) comes in two versions: "narrow" and "wide". The narrow version only supports 16-bit unicodes, and requires two units for astrals:

Python 2.7.5 (default, Mar  9 2014, 22:15:05) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
65535
>>> s = u'#Bangalore'
>>> s.index('#')
2

"wide" builds use 32 bits and represent all unicode chars with one unit:

Python 2.7.2+ (default, Jul 20 2012, 22:15:08) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
1114111
>>> s = u'#Bangalore'
>>> s.index('#')
1

Possible workarounds are

Community
  • 1
  • 1
georg
  • 211,518
  • 52
  • 313
  • 390
  • Thank you! This is the reason, however I notice that in Android, where I am using the indices is compatible with narrow width indices. I will need to install narrow python .. – Pankaj Garg Mar 13 '15 at 14:14