15

I work on an application that uses texts from different languages, so, for viewing or reporting purposes, some texts (strings) need to be sorted in a specific language.

Currently I have a workaround messing with the global locale settings, which is bad, and I don't want to put it in production:

default_locale = locale.getlocale(locale.LC_COLLATE)

def sort_strings(strings, locale_=None):
    if locale_ is None:
        return sorted(strings)

    locale.setlocale(locale.LC_COLLATE, locale_)
    sorted_strings = sorted(strings, cmp=locale.strcoll)
    locale.setlocale(locale.LC_COLLATE, default_locale)

    return sorted_strings

The official python locale documentation explicitly says that saving and restoring is a bad idea, but does not give any suggestions: http://docs.python.org/library/locale.html#background-details-hints-tips-and-caveats

Eryk Sun
  • 33,190
  • 5
  • 92
  • 111
vonPetrushev
  • 5,457
  • 6
  • 39
  • 51

3 Answers3

9

You could use a PyICU's collator to avoid changing global settings:

import icu # PyICU

def sorted_strings(strings, locale=None):
    if locale is None:
       return sorted(strings)
    collator = icu.Collator.createInstance(icu.Locale(locale))
    return sorted(strings, key=collator.getSortKey)

Example:

>>> L = [u'sandwiches', u'angel delight', u'custard', u'éclairs', u'glühwein']
>>> sorted_strings(L)
['angel delight', 'custard', 'glühwein', 'sandwiches', 'éclairs']
>>> sorted_strings(L, 'en_US')
['angel delight', 'custard', 'éclairs', 'glühwein', 'sandwiches']

Disadvantage: dependency on PyICU library; the behavior is slightly different from locale.strcoll.


I don't know how to get locale.strxfrm function given a locale name without changing it globally. As a hack you could run your function in a different child process:

pool = multiprocessing.Pool()
# ...
pool.apply(locale_aware_sort, [strings, loc])

Disadvantage: might be slow, resource hungry


Using ordinary threading.Lock won't work unless you can control every place where locale aware functions (they are not limited to locale module e.g., re) could be called from multiple threads.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
6

The ctypes solution is fine, but if anyone in the future would like just to modify your original solution, here is a way how to do so:

Temporary changes of global settings can safely be accomplished with a context manager.

from contextlib import contextmanager
import locale

@contextmanager
def changedlocale(newone):
    old_locale = locale.getlocale(locale.LC_COLLATE)
    try:
        locale.setlocale(locale.LC_COLLATE, newone)
        yield locale.strcoll
    finally:
        locale.setlocale(locale.LC_COLLATE, old_locale)

def sort_strings(strings, locale_=None):
    if locale_ is None:
        return sorted(strings)

    with changedlocale(locale_) as strcoll:
        return sorted(strings, cmp=strcoll)

    return sorted_strings

This ensures a clean restoration of the original locale - as long as you don't use threading.

glglgl
  • 89,107
  • 13
  • 149
  • 217
4

Glibc does support a locale API with an explicit state. Here's a quick wrapper for that API made with ctypes.

# -*- coding: utf-8
import ctypes


class Locale(object):
    def __init__(self, locale):
        LC_ALL_MASK = 8127
        # LC_COLLATE_MASK = 8
        self.libc = ctypes.CDLL("libc.so.6")
        self.ctx = self.libc.newlocale(LC_ALL_MASK, locale, 0)



    def strxfrm(self, src, iteration=1):
        size = 3 * iteration * len(src)
        dest =  ctypes.create_string_buffer('\000' * size)
        n = self.libc.strxfrm_l(dest, src, size,  self.ctx)
        if n < size:
            return dest.value
        elif iteration<=4:
            return self.strxfrm(src, iteration+1)
        else:
            raise Exception('max number of iterations trying to increase dest reached')


    def __del__(self):
        self.libc.freelocale(self.ctx)

and a short test

locale1 = Locale('C')
locale2 = Locale('mk_MK.UTF-8')

a_list = ['а', 'б', 'в', 'ј', 'ќ', 'џ', 'ш']
import random
random.shuffle(a_list)

assert sorted(a_list, key=locale1.strxfrm) == ['а', 'б', 'в', 'ш', 'ј', 'ќ', 'џ']
assert sorted(a_list, key=locale2.strxfrm) == ['а', 'б', 'в', 'ј', 'ќ', 'џ', 'ш']

what's left to do is implement all the locale functions, support for python unicode strings (with wchar* functions I guess), and automatically import the include file definitions or something

gdamjan
  • 998
  • 9
  • 12