Binary search (bisect) for a Unicode character in Python 2.7

Question

I have a sorted list, J, of 2048 Japanese Unicode words which ideally I'd like to index using a binary search, as discussed in this SO question ("Binary search (bisection) in Python). The binary_search function uses bisect_left from the bisect module instead of indexing the list with the list.index function.

import bisect
def binary_search(a, x, lo=0, hi=None): # can't use a to specify default for hi
    hi = hi if hi is not None else len(a)   # hi defaults to len(a)
    pos = bisect.bisect_left(a, x, lo, hi)  # find insertion point
    return (pos if pos != hi and a[pos] == x else -1)

Now, let's index each word in words:

words = "そつう　れきだい　ほんやく　わかす　りくつ　ばいか　ろせん　やちん　そつう　れきだい　ほんやく　わかめ"

Let's split to make a list:

>>> words.split()
[u'\u305d\u3064\u3046', u'\u308c\u304d\u305f\u3099\u3044', u'\u307b\u3093\u3084\u304f', u'\u308f\u304b\u3059', u'\u308a\u304f\u3064', u'\u306f\u3099\u3044\u304b', u'\u308d\u305b\u3093', u'\u3084\u3061\u3093', u'\u305d\u3064\u3046', u'\u308c\u304d\u305f\u3099\u3044', u'\u307b\u3093\u3084\u304f', u'\u308f\u304b\u3081']`)

Comparing binary_search with list.index:

>>> [ binary_search(J, x) for x in words.split()]
[-1, 2015, 1790, 2039, 1983, -1, 2031, 1919, -1, 2015, 1790, 2040]
>>> [ J.index(x) for x in words.split()]
[1019, 2015, 1790, 2039, 1983, 1533, 2031, 1919, 1019, 2015, 1790, 2040]

So for u'\u305d\u3064\u3046' (そつう), instead of an index of 1019, binary_search is returning an index of -1 (fail) since pos evaluates to 1027, u'\u305d\u3068\u304b\u3099\u308f'. Both words (index 1019 & 1027) begin with u'\u305d', so that is where the issue seems to be.

How can binary_search be tweaked to find the index of a (Japanese) Unicode character?

Binary search requires the input list to be sorted. Is your input sorted? — Martijn Pieters, Jul 01 '15 at 08:39
What is `J`? It is not your `words.split()` list. You cannot use the `words` value here as that is not a list but a string. — Martijn Pieters, Jul 01 '15 at 08:42
@MartijnPieters the list, J, *is* sorted. The code takes a Unicode mnemonic string; I've simplified it because of ideographic spaces used in Japanese. Binary_search is indexing words from a list, hence the split — Aussie Cryptocurrency, Jul 01 '15 at 08:42
The search works just fine if I use `J = sorted(words.split())`. — Martijn Pieters, Jul 01 '15 at 08:43
Ah, `J` is supposed to be the file loaded from github, split by lines? Did you strip whitespace? How did you load and sort that list? — Martijn Pieters, Jul 01 '15 at 08:44
When I load `J` from github with `J = sorted(requests.get('https://raw.githubusercontent.com/trezor/python-mnemonic/master/mnemonic/wordlist/japanese.txt').text.splitlines())` I get different output for the binary search and for the `J.index()` calls. The list is `[1024, 2015, 1787, 2039, 1983, 1601, 2031, 1919, 1024, 2015, 1787, 2040]` in both cases. — Martijn Pieters, Jul 01 '15 at 08:45
In other words, the binary search works *fine*, but your `J` list is not properly sorted or cleaned. I note that the discrepancies between my list and yours are also the points where the binary search fails, except for the two indices that are close to my results. — Martijn Pieters, Jul 01 '15 at 08:46
@MartijnPieters [test.py](https://github.com/simcity4242/pybitcointools/blob/master/test.py#L563-L570). I see the issue now, if you want to put this into an answer I'll check it as solved — Aussie Cryptocurrency, Jul 01 '15 at 08:48

score 1 · Accepted Answer · answered Jul 01 '15 at 08:52

It appears your J list is at fault here; it is not correctly sorted.

When I use the requests library to load your test data file and sort the information, it works fine:

 >>> # your bisect function defined
 ...
 >>> import requests
 >>> words = u"そつう　れきだい　ほんやく　わかす　りくつ　ばいか　ろせん　やちん　そつう　れきだい　ほんやく　わかめ"
 >>> url = 'https://raw.githubusercontent.com/trezor/python-mnemonic/master/mnemonic/wordlist/japanese.txt'
 >>> J = sorted(requests.get(url).text.splitlines())
 >>> [ J.index(x) for x in words.split()]
[1024, 2015, 1787, 2039, 1983, 1601, 2031, 1919, 1024, 2015, 1787, 2040]
>>> [ binary_search(J, x) for x in words.split()]
[1024, 2015, 1787, 2039, 1983, 1601, 2031, 1919, 1024, 2015, 1787, 2040]

Note that the indices don't quite match your test results; where the indices are off by more than 3 the bisect fails to find the result.

Binary search (bisect) for a Unicode character in Python 2.7

1 Answers1