Problem with Regular Expression in Python

Question

I have a function in python which returns a tuple of a given key for the Natural-Sort/Human algorithm.

But I need this to change this to replace German umlauts by their standard alphabetical characters.

Long story short, I want to get rid of Ä, Ö, Ü, ß for the sorting.

Also, the case should not be considered. A small d should have the same priority as a capital D...

For the umlauts I am utilizing the replace-function which seems a pretty awkward way to do it... :-/ I have no better idea... Any suggestions?

Also I am not able to rewrite this to get rid of the case sensitiveness...

So far I have:

def _human_key(key):
    key = key.replace("Ä", "A").replace("Ö", "O").replace("Ü", "U")\
          .replace("ä", "a").replace("ö", "o").replace("ü", "u")\
          .replace("ß", "s")
    parts = re.split(r'(\d*\.\d+|\d+)', key)   
    return tuple((e.swapcase() if i % 2 == 0 else float(e))
            for i, e in enumerate(parts))
    return parts

Examples: I have the values

 Zabel
 Schneider
 anabel
 Arachno
 Öztürk
 de 'Hahn

which I want to sort; currently this puts:

anabel
de 'Hahn
Arachno
Öztürk
Schneider
Zabel

because the small characters a treated with priority...

Expectation:

anabel
Arachno
de 'Hahn   ( <-- because "d" comes after "a")
Öztürk
Schneider

I feel the replace is not the right way to achieve the problem with the umlauts, but can't find a better solution.

Update/Background information:

I am calling this from outside, from the class "QSortFilterProxyModel", I need this for sorting rows according to their clicked columns. I have a QTreeView whichs displays a result set from the database, and one column contains german family names, that's the background.

class HumanProxyModel(QtCore.QSortFilterProxyModel):
    def lessThan(self, source_left, source_right):
        data_left = source_left.data()
        data_right = source_right.data()
        if type(data_left) == type(data_right) == str:            
            return _human_key(data_left) < _human_key(data_right)            
        return super(HumanProxyModel, self).lessThan(source_left, source_right)

Why not go with something like https://stackoverflow.com/a/25057291/3820185 — wiesion, Mar 20 '19 at 12:42
Why do you want to replace unicode characters? They are characters after all. — Sven-Eric Krüger, Mar 20 '19 at 12:44
There is a difference between upper and lower case. You could avoid this by converting all keys to one case within the function. — Sven-Eric Krüger, Mar 20 '19 at 12:53
Are you sure what you doing is right? You replace 'ä' by 'a' while it should be 'ae', you replace 'ö' by 'o' while it should be 'oe'. See "https://en.wikipedia.org/wiki/Diaeresis_(diacritic)", "Printing conventions in German". — Dominique, Mar 20 '19 at 13:01
@Dominique: I am aware of that thanks. But I would look strange if I sort e.g. ... Öztürk Ottelo ... Everybody will ask "hey, why is the 'z' before the 't' - that's wrong!" Technically because Ö -> Oe -> comes before Ot... but nobody understands that. So I want to get "Ottelo, Öztürk " — ProfP30, Mar 20 '19 at 13:14

Nqsir · Answer 1 · 2019-03-20T16:54:32.223

1

does that help ?

import locale
locale.setlocale(locale.LC_ALL, "")

lst = ['Zabel', 'Schneider', 'anabel', 'Arachno', 'Öztürk', 'de Hahn']

print(sorted(lst, key=locale.strxfrm))

gave me :

['anabel', 'Arachno', 'de Hahn', 'Öztürk', 'Schneider', 'Zabel']

To go further I've been on : http://code.activestate.com/recipes/576507-sort-strings-containing-german-umlauts-in-correct-/

UPDATE

Ok so if you want to keep your method and get rid of umlauts you can do something like this, there are tons of better way to do it, but that's a start:

import locale
locale.setlocale(locale.LC_ALL, "")

lst = ['Zabel', 'Schneider', 'anabel', 'Arachno', 'Öztürk', 'de Hahn']

def _human_key(your_list):
    your_list.sort(key=locale.strxfrm)
    res = []
    for item in your_list:
        word = item.replace("Ä", "A").replace("Ö", "O").replace("Ü", "U")\
              .replace("ä", "a").replace("ö", "o").replace("ü", "u")\
              .replace("ß", "s")
        res.append(word)
    return res

print(_human_key(lst))

gave me :

['anabel', 'Arachno', 'de Hahn', 'Ozturk', 'Schneider', 'Zabel']

Nothing mean, but using Regex doesn't seem to be an appropriate tag and/or approach on your problem if you could not implement the previous code in your method. Hope it helped

edited Mar 20 '19 at 16:54

answered Mar 20 '19 at 14:04

Nqsir

829
11
19

looks good, but how to work this into the _human_key() function? I have no idea... :-/ – ProfP30 Mar 20 '19 at 15:43
1

@ProfP30 You do not need the function `_human_key` anymore. You can use `lst.sort(key=locale.strxfrm)`... @Nqsir When I use this solution "Öztürk" comes before "de Hahn" – Sven-Eric Krüger Mar 20 '19 at 16:11
@Sven Krüger, which version of python do you have ? I have 3.7.2 – Nqsir Mar 20 '19 at 17:09
I have 3.7.1 but I am afraid with the solution `locale.setlocale(locale.LC_ALL, "")` the clients local settings comes into affect, so you could get different results, depending on the machine... which is what I would like to avoid... With the `key=locale.strxfrm` solution I also get undesired results. – ProfP30 Mar 20 '19 at 17:13
please see my update on the background information section – ProfP30 Mar 20 '19 at 17:25

SethMMorton · Accepted Answer · 2019-03-21T06:36:15.540

If you don't mind using third-party modules, you can use natsort (full disclosure, I am the author). For the data you give, it returns what you want out-of-the-box.

>>> from natsort import natsorted, ns
>>> data = ['Zabel', 'Schneider', 'anabel', 'Arachno', 'Öztürk', 'de Hahn']
>>> natsorted(data, alg=ns.LOCALE)  # ns.LOCALE turns on locale-aware handling
['anabel', 'Arachno', 'de Hahn', 'Öztürk', 'Schneider', 'Zabel']
>>> from natsort import humansorted
>>> humansorted(data)  # shortcut for using LOCALE
['anabel', 'Arachno', 'de Hahn', 'Öztürk', 'Schneider', 'Zabel']

If you need a sorting key, you can use natsort's key-generator:

>>> from natsort import natsort_keygen, ns
>>> humansort_key = natsort_keygen(alg=ns.LOCALE)
>>> humansort_key(this) < humansort_key(that)

Note, you don't necessarily need to use locale... you just need to properly normalize the unicode, which natsort automatically does under the hood. In your case, it looks like you want both capital and lower case letters grouped together with the lowercase first, so you could use this instead

>>> natsorted(data, alg=ns.GROUPLETTERS | ns.LOWERCASEFIRST)  # or ns.G | ns.LF
['anabel', 'Arachno', 'de Hahn', 'Öztürk', 'Schneider', 'Zabel']

I suggest this because trying to deal with locale is a nightmare, and if it is not needed then you are much better off.

thank you, this seems to do the job how I want it to: from natsort import natsort_keygen, ns humansort_key = natsort_keygen(alg=ns.LOCALE) class HumanProxyModel(QtCore.QSortFilterProxyModel): def lessThan(self, source_left, source_right): data_left = source_left.data() data_right = source_right.data() if type(data_left) == type(data_right) == str: return humansort_key(data_left) < humansort_key(data_right) return super(HumanProxyModel, self).lessThan(source_left, source_right) — ProfP30, Mar 26 '19 at 10:13

Problem with Regular Expression in Python

2 Answers2