I have a sorted list, J
, of 2048 Japanese Unicode words which ideally I'd like to index using a binary search, as discussed in this SO question ("Binary search (bisection) in Python). The binary_search
function uses bisect_left
from the bisect
module instead of indexing the list with the list.index
function.
import bisect
def binary_search(a, x, lo=0, hi=None): # can't use a to specify default for hi
hi = hi if hi is not None else len(a) # hi defaults to len(a)
pos = bisect.bisect_left(a, x, lo, hi) # find insertion point
return (pos if pos != hi and a[pos] == x else -1)
Now, let's index each word in words
:
words = "そつう れきだい ほんやく わかす りくつ ばいか ろせん やちん そつう れきだい ほんやく わかめ"
Let's split to make a list:
>>> words.split()
[u'\u305d\u3064\u3046', u'\u308c\u304d\u305f\u3099\u3044', u'\u307b\u3093\u3084\u304f', u'\u308f\u304b\u3059', u'\u308a\u304f\u3064', u'\u306f\u3099\u3044\u304b', u'\u308d\u305b\u3093', u'\u3084\u3061\u3093', u'\u305d\u3064\u3046', u'\u308c\u304d\u305f\u3099\u3044', u'\u307b\u3093\u3084\u304f', u'\u308f\u304b\u3081']`)
Comparing binary_search
with list.index
:
>>> [ binary_search(J, x) for x in words.split()]
[-1, 2015, 1790, 2039, 1983, -1, 2031, 1919, -1, 2015, 1790, 2040]
>>> [ J.index(x) for x in words.split()]
[1019, 2015, 1790, 2039, 1983, 1533, 2031, 1919, 1019, 2015, 1790, 2040]
So for u'\u305d\u3064\u3046'
(そつう
), instead of an index of 1019, binary_search
is returning an index of -1 (fail) since pos
evaluates to 1027, u'\u305d\u3068\u304b\u3099\u308f'
. Both words (index 1019 & 1027) begin with u'\u305d'
, so that is where the issue seems to be.
How can binary_search
be tweaked to find the index of a (Japanese) Unicode character?