How can I change how Python sort deals with punctuation?

Question

I'm currently trying to rewrite an R script in Python. I've been tripped up because it looks like R and Python sort some punctuation differently. Specifically '&' and '_'. At some point in my program I sort by an identifier column in a Pandas dataframe.

As an example in Python:

t = ["1&2","1_2"]
sorted(t)

results in

['1&2', '1_2']

Comparatively in R:

t <- c("1&2","1_2")
sort(t)

results in

[1] "1_2" "1&2"

According to various resources (https://www.dconc.gov/home/showpublisheddocument/1481/635548543552170000) Python is doing the correct thing, but unfortunately I need to do the wrong thing here (changing R is not in scope).

Is there a straight forward way that I can change for Python would sort this? Specifically I'll need to be able to do this on pandas dataframes when sorting by an ID column

See the answer to [to question](https://stackoverflow.com/a/26579479/18571565), where you can define your own custom order for sorting. This might help. — Rawson, Jan 31 '23 at 17:54
Welcome to Stack Overflow. **What specifically is the order** that you want to implement? Saying "`_` should come before `&`" doesn't tell us anything about any other code point. As for Pandas, its sorting will have essentially the same interface. Are you familiar with how the `key` argument works *generally* for Python's sorting routines? Or else what exactly do you need to know about it? — Karl Knechtel, Jan 31 '23 at 18:01
@KarlKnechtel [Possible duplicate](https://stackoverflow.com/q/1097908/12671057), though I'm not sure it's applicable/appropriate in this specific case and not eager enough to find out, maybe you are. Found by googling [python sort collation](https://www.google.com/search?q=python+sort+collation). — Kelly Bundy, Jan 31 '23 at 23:12
How does R sort, exactly? I just realized all answers so far might actually be wrong, for example if R just moves "_" to before "&" in the order of all characters (instead of for example swapping those two). I don't know R much and you didn't invite the R folks (not entirely sure that would be appropriate, but I think it is). — Kelly Bundy, Feb 01 '23 at 00:47
It may actually be easier to change R in this regard, but this is probably information that is helpful either way: https://stackoverflow.com/questions/7229408/what-are-the-r-sorting-rules-of-character-vectors — juanpa.arrivillaga, Feb 01 '23 at 02:43
@KellyBundy I don't know that locale-based sorting would help, because I don't know that R's sort order is actually based on any particular locale. - actually, wait. juanpa's link implies that the behaviour is exactly due to the OP's locale setting. — Karl Knechtel, Feb 01 '23 at 09:25

Claudio · Accepted Answer · 2023-02-02T00:56:11.013

You have the option of just skipping all the following text to FINALLY and use the provided code for sorting Python lists of strings like they would be sorted in R or learn a bit about Python reading the answer from top to bottom:

Like already mentioned in the comment to your question by Rawson (giving appropriate helpful link) you can define the order in which sorting should take place for any characters you choose to take out of the usual sorting order:

t = ['1&2', '1_2']
print(sorted(t))

alphabet = {"_":-2, "&":-1}
def sortkey(word):
    return [ alphabet.get(chr, ord(chr)) for chr in word ]
    # what means:
    # return [ alphabet[chr] if chr in alphabet else ord(chr) for chr in word ]

print(sortkey(t[0]), sortkey(t[1]))
print(sorted(t, key=sortkey))

gives:

['1&2', '1_2']
[49, -1, 50] [49, -2, 50]
['1_2', '1&2']

Use negative values to define the alphabet order so you can use ord() for any other not redefined parts of the alphabet (advantage: avoiding possible problems with Unicode strings).

If you want to redefine many of the characters and use only the printable ones you can also define an own alphabet string like follows:

#                                                                                v                    v
alphabet = """0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%_'()*+,-./:;<=>?@[\]^&`{|}~"""

and then use to sort by it:

print(sorted(t, key=lambda s: [alphabet.index(c) for c in s]))

For extended use on a huge number of data to sort consider to turn the alphabet to a dictionary:

dict_alphabet = { alphabet[i]:i for i in range(len(alphabet)) }
print(sorted(t, key=lambda s: [dict_alphabet[c] for c in s ]))

or best use the in Python available character translation feature available for strings:

alphabet = """0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%_'()*+,-./:;<=>?@[\]^&`{|}~"""
table = str.maketrans(alphabet, ''.join(sorted(alphabet)))
print(sorted(t, key=lambda s: s.translate(table)))

By the way: you can get a list of printable Python characters using the string module:

import string
print(string.printable) # includes Form-Feed, Tab, VT, ...

FINALLY

Below ready to use Python code for sorting lists of strings exactly like they would be sorted in R:

Rcode = """\
s <- "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!#$%&()*+,-./:;<=>?@[\\]^_`{|}~"
paste(sort(unlist(strsplit(s, ""))), collapse = "")"""
RsortOrder = "_-,;:!?.()[]{}@*/\\&#%`^+<=>|~$0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ"
# ^--- result of running the R-code online ( [TIO][1] )
# print(''.join(sorted("_-,;:!?.()[]{}@*/\\&#%`^+<=>|~$0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ")))
PythonSort = "!#$%&()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
# ===========================================
alphabet = RsortOrder
table = str.maketrans(alphabet, ''.join(sorted(alphabet)))
print(">>>",sorted(["1&2","1_2"], key=lambda s: s.translate(table)))

printing

>>> ['1_2', '1&2']

Run the R-code online using: TIO or generate your own RsortOrder running the provided R-code and using your specific locale setting in R as suggested in the comments to your question by juanpa.arrivillaga .

Alternatively you can use the Python locale module for the purpose of usage of the same locale setting as it is used in R: ( https://stackoverflow.com/questions/1097908/how-do-i-sort-unicode-strings-alphabetically-in-python )

import locale
# this reads the environment and inits the right locale
locale.setlocale(locale.LC_ALL, "")
# locale.strxfrm(string)
# Transforms a string to one that can be used in locale-aware comparisons. 
# For example, strxfrm(s1) < strxfrm(s2) is equivalent to strcoll(s1, s2) < 0. 
# This function can be used when the same string is compared repeatedly, 
# e.g. when collating a sequence of strings.
print("###",sorted(["1&2","1_2"], key=locale.strxfrm))

prints

### ['1_2', '1&2']

Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/251534/discussion-between-claudio-and-kelly-bundy). — Claudio, Feb 01 '23 at 00:50
do you happen to know what locale setting produces `['1_2', '1&2']`? — juanpa.arrivillaga, Feb 01 '23 at 03:45
I get `### ['1&2', '1_2']` for that last attempt: [demo](https://tio.run/##ZVFNa8MwDL33VwgHRgJZWLPLKNthDHbqsbcxipoojVkse7b79eszuWk7xk6Spaf3nmR3ir3lxyfnx1EbZ32EwTY40CyD2OsAnrANkhIQ77W3bIgjILegWcep4/W2v81NoQoUpyy/FJZv69flsgSlCuG@oqI/dt7kEjVvU2PlkUNnvQmAMJUhWrBMIoURGmTYEOwCJQcXnns8oCdorHHodbAcKhCud@uBjmjcQCXcpOYFPP@@6gJkS/re6b0QyWoiJs3GDoNAZaxO8Icz3yodpNtxE7XlP04OPfH5FAENXW0LeHIkAE@OMFI7nMrERNW2moaSEMYEl3XFBnFDYLsLR6hmTmLMVZZlqgzyP9TmH2p@V6tSzde1@izhi04vA5pNKwyLf5ctimI2jj8) — Kelly Bundy, Feb 01 '23 at 03:49
@KellyBundy probably because your default locale is not the one that gives you that order. Honestly, the OP shouldn't be just relying on their locale for sort order. Reading between the lines, the OP seems to want R and Python to work equivalently, and this will do it. But IMO the best solution would be to set the R locale to something appropriate in UTF-8 — juanpa.arrivillaga, Feb 01 '23 at 03:54
`print(locale.getdefaultlocale())` gives me: `('en_US', 'UTF-8')` and the output is `['1_2', '1&2']`. — Claudio, Feb 01 '23 at 04:22
@KellyBundy if you're using an online service to test the code, it probably doesn't implement any particular locale and will thus fall back on defaults. — Karl Knechtel, Feb 01 '23 at 09:28
That said, I can't reproduce the new sort order with either my default locale (`'en_CA.UTF-8'`) nor the reported one (`'en_US.UTF-8'`). — Karl Knechtel, Feb 01 '23 at 09:30
R uses the UCA for sorting, Python doesn't. Using locale for sorting is highly problematic and not a cross-platform solution. You will only get locale based sorting on Windows and on Linux systems using Glibc. On systems using bsd libc like macOS, FreeBSD, BSDNet, etc most collation tables are symlinked to single table and will not sort based on locale. Linux systems using musl libc and other varieties of libc do not have locale based collation available. — Andj, Feb 23 '23 at 18:51

Michael Cao · Answer 2 · 2023-01-31T20:06:12.640

1

Use a custom key for sorting. Here, we can just swap the & and _. We do the swap by using list comprehension and breaking a string into a list of its individual characters, but we swap the & and _ characters. Then we rebuild the string with a ''.join'.

t = ["1&2","1_2", "5&3"]
    
def swap_chars(s):
    return ''.join([c if 
                    c not in ['&', '_'] 
                    else '_' if c == '&' 
                    else '&' for c in s])
    
sorted(t, key = swap_chars)

edited Jan 31 '23 at 20:06

answered Jan 31 '23 at 17:59

Michael Cao

2,278
1
1
13

What about `[ "1_&_2", "1&_&2" ]`? Just test what it gives ... You need another code for actual swapping chars in a string. – Claudio Jan 31 '23 at 19:25
@Claudio, I was operating under the assumption the input was going to be in the `\d(&|_)\d` format, but you make a fair point that it's not really a full-on swap function. I've edited in a more robust swap function. – Michael Cao Jan 31 '23 at 19:37
Please consider to change the Python module name `string` to for example `s` to avoid problems with hard to debug code because it is using Python keywords and/or function/module names for naming of variables and/or function parameter (same is valid for `dict, list, set, time`, etc.) – Claudio Jan 31 '23 at 19:53
Much easier to use the Unicode Collation Algorithm – Andj Feb 23 '23 at 14:22

Andj · Answer 3 · 2023-03-22T00:47:35.133

Actually, depending on which sort method you are using in R, Python and R use different collation algorithms. R's sort is either based on Unicode Collation Algorithm or on a libc locale. Python's uses libc. R in this instance is more flexible and can be compatible with other languages.

As others have noted, you could set LC_COLLATE to the C locale for both R and Python to get consistent results across languages.

Alternatively, if you have icu4c on your system, and PyICU installed, the following code illustrates the difference in sorting:

t = ["1&2","1_2"]
sorted(t)
# ['1&2', '1_2']

import icu
collator = icu.Collator.createInstance(icu.Locale.getRoot())
sorted(t, key=collator.getSortKey)
# ['1_2', '1&2']

The collator instance is using the root collation (i.e. the CLDR Collation Algorithm, a tailoring of the Unicode Collation Algorithm)

There are many differences between R and Python sort. The obvious one if how upper and lower case are sorted. Using PyICU:

l = ['a', 'Z', 'A']
sorted(l)
# ['A', 'Z', 'a']
sorted(l, key=collator.getSortKey)
# ['a', 'A', 'Z']

In R:

l <- c("a", "Z", "A")
sort(l)
#[1] "a" "A" "Z"

Alternatively, it's possible to use DUCET (UCA) rather than CLDR's root collation, they will give the same results in this instance.

from pyuca import Collator as ducetCollator
coll = ducetCollator()
sorted(t, key=coll.sort_key)
['1_2', '1&2']

Although, I would use an updated allkeys file for DUCET.

How can I change how Python sort deals with punctuation?

3 Answers3