How are strings compared?

Question

I'm wondering how Python does string comparison, more specifically how it determines the outcome when a less than < or greater than > operator is used.

For instance if I put print('abc' < 'bac') I get True. I understand that it compares corresponding characters in the string, however its unclear as to why there is more, for lack of a better term, "weight" placed on the fact that a is less thanb (first position) in first string rather than the fact that a is less than b in the second string (second position).

_{Many people ask this question when the strings contain representations of numbers, and want to compare the numbers by numeric value. The straightforward solution is to convert the values first. See How do I parse a string to a float or int? . If there are multiple numbers in a list or other collection, see How can I collect the results of a repeated calculation in a list, dictionary etc. (or make a copy of a list with each element modified)? for batch conversion.}

_{If you are trying to compare strings that contain digit sequences, treating the digits as if they were numeric (sometimes called "natural sort"), see Is there a built in function for string natural sort? .}

What? How else can ordering be defined other than left-to-right? — S.Lott, Jan 26 '11 at 16:21
@S.Lott: right-to-left. Not that anyone would do so, but it's not the only possibility. — Katriel, Jan 26 '11 at 16:22
@katrielalex: If you allow that, you'd have to allow random and even-only and odd-only and every other possibility. Then you'd have to "parameterize" the operator to pick which ordering. If there's going to be a default, how could it be other than left-to-right? — S.Lott, Jan 26 '11 at 16:27
@S.Lott: I agree -- lex is the only sensible order to use. I just nitpicked that it's certainly not the only *possible* order! — Katriel, Jan 26 '11 at 16:30
@S.Lott: To answer your question, you might use `sorted(range(10), key=lambda i: i ^ 123)` for numbers or `sorted('How else can ordering be defined other than left-to-right?'.split(), key= lambda s: s[::-1])` for text. They are definite (if unhelpful) orderings. — Noctis Skytower, Nov 15 '12 at 18:38
"why there is more, for lack of a better term, "weight" placed on the fact that a is less thanb (first position) in first string rather than the fact that a is less than b in the second string (second position)." For the same reason that there is more "weight" placed on the fact that 3 > 2, than on 1 < 9, when comparing `31 > 29`. — Karl Knechtel, Feb 13 '23 at 20:50

score 123 · Accepted Answer · edited Feb 13 '23 at 04:03

From the docs:

The comparison uses lexicographical ordering: first the first two items are compared, and if they differ this determines the outcome of the comparison; if they are equal, the next two items are compared, and so on, until either sequence is exhausted.

Also:

Lexicographical ordering for strings uses the Unicode code point number to order individual characters.

or on Python 2:

Lexicographical ordering for strings uses the ASCII ordering for individual characters.

As an example:

>>> 'abc' > 'bac'
False
>>> ord('a'), ord('b')
(97, 98)

The result False is returned as soon as a is found to be less than b. The further items are not compared (as you can see for the second items: b > a is True).

Be aware of lower and uppercase:

>>> [(x, ord(x)) for x in abc]
[('a', 97), ('b', 98), ('c', 99), ('d', 100), ('e', 101), ('f', 102), ('g', 103), ('h', 104), ('i', 105), ('j', 106), ('k', 107), ('l', 108), ('m', 109), ('n', 110), ('o', 111), ('p', 112), ('q', 113), ('r', 114), ('s', 115), ('t', 116), ('u', 117), ('v', 118), ('w', 119), ('x', 120), ('y', 121), ('z', 122)]
>>> [(x, ord(x)) for x in abc.upper()]
[('A', 65), ('B', 66), ('C', 67), ('D', 68), ('E', 69), ('F', 70), ('G', 71), ('H', 72), ('I', 73), ('J', 74), ('K', 75), ('L', 76), ('M', 77), ('N', 78), ('O', 79), ('P', 80), ('Q', 81), ('R', 82), ('S', 83), ('T', 84), ('U', 85), ('V', 86), ('W', 87), ('X', 88), ('Y', 89), ('Z', 90)]

Specifically, this has the consequence of 'a' > 'A', 'b' > 'B', etc. including 'a' > 'Z' all evaluate to True as all lowercase characters from a to z have a higher code point number than all uppercase characters.

Just wanted to add that if one sequence is exhausted, that sequence is less: `'abc' < 'abcd'`. — Noumenon, May 06 '16 at 07:22
Thank you for this, might be useful to add that it works for number strings too. I was just having this issue `"24" > 40` = `True` due to `ord("2")` = `50` — rustysys-dev, Feb 21 '18 at 03:59
@vaultah: Just to save other people reading your comment the need to read the question you're linking to, the relevant rule for Python 2 is "When you order a numeric and a non-numeric type, the numeric type comes first." (Python 3 raises a TypeError exception instead, btw.) — Rörd, May 20 '19 at 12:06

score 13 · Answer 2 · answered Jan 26 '11 at 16:20

13

Python string comparison is lexicographic:

From Python Docs: http://docs.python.org/reference/expressions.html

Strings are compared lexicographically using the numeric equivalents (the result of the built-in function ord()) of their characters. Unicode and 8-bit strings are fully interoperable in this behavior.

Hence in your example, 'abc' < 'bac', 'a' comes before (less-than) 'b' numerically (in ASCII and Unicode representations), so the comparison ends right there.

answered Jan 26 '11 at 16:20

wkl

77,184
16
165
176

So, does it end the comparison as soon as it finds that one of the characters is less than the one it corresponds with? – davelupt Jan 26 '11 at 16:33
@David: Yes. Either less than or greater than. If they are equal, the next items are compared. – user225312 Jan 26 '11 at 16:37

score 8 · Answer 3 · answered Jan 26 '11 at 19:52

Python and just about every other computer language use the same principles as (I hope) you would use when finding a word in a printed dictionary:

(1) Depending on the human language involved, you have a notion of character ordering: 'a' < 'b' < 'c' etc

(2) First character has more weight than second character: 'az' < 'za' (whether the language is written left-to-right or right-to-left or boustrophedon is quite irrelevant)

(3) If you run out of characters to test, the shorter string is less than the longer string: 'foo' < 'food'

Typically, in a computer language the "notion of character ordering" is rather primitive: each character has a human-language-independent number ord(character) and characters are compared and sorted using that number. Often that ordering is not appropriate to the human language of the user, and then you need to get into "collating", a fun topic.

score 6 · Answer 4 · edited May 23 '17 at 12:32

Take a look also at How do I sort unicode strings alphabetically in Python? where the discussion is about sorting rules given by the Unicode Collation Algorithm (http://www.unicode.org/reports/tr10/).

To reply to the comment

What? How else can ordering be defined other than left-to-right?

by S.Lott, there is a famous counter-example when sorting French language. It involves accents: indeed, one could say that, in French, letters are sorted left-to-right and accents right-to-left. Here is the counter-example: we have e < é and o < ô, so you would expect the words cote, coté, côte, côté to be sorted as cote < coté < côte < côté. Well, this is not what happens, in fact you have: cote < côte < coté < côté, i.e., if we remove "c" and "t", we get oe < ôe < oé < ôé, which is exactly right-to-left ordering.

And a last remark: you shouldn't be talking about left-to-right and right-to-left sorting but rather about forward and backward sorting.

Indeed there are languages written from right to left and if you think Arabic and Hebrew are sorted right-to-left you may be right from a graphical point of view, but you are wrong on the logical level!

Indeed, Unicode considers character strings encoded in logical order, and writing direction is a phenomenon occurring on the glyph level. In other words, even if in the word שלום the letter shin appears on the right of the lamed, logically it occurs before it. To sort this word one will first consider the shin, then the lamed, then the vav, then the mem, and this is forward ordering (although Hebrew is written right-to-left), while French accents are sorted backwards (although French is written left-to-right).

MSeifert · Answer 5 · 2022-10-21T10:48:19.603

4

A pure Python equivalent for string comparisons would be:

def less(string1, string2):
    # Compare character by character
    for idx in range(min(len(string1), len(string2))):
        # Get the "value" of the character
        ordinal1, ordinal2 = ord(string1[idx]), ord(string2[idx])
        # If the "value" is identical check the next characters
        if ordinal1 == ordinal2:
            continue
        # It's not equal so we're finished at this index and can evaluate which is smaller.
        else:
            return ordinal1 < ordinal2
    # We're out of characters and all were equal, so the result depends on the length
    # of the strings.
    return len(string1) < len(string2)

This function does the equivalent of the real method (Python 3.6 and Python 2.7) just a lot slower. Also note that the implementation isn't exactly "pythonic" and only works for < comparisons. It's just to illustrate how it works. I haven't checked if it works like Pythons comparison for combined unicode characters.

A more general variant would be:

from operator import lt, gt

def compare(string1, string2, less=True):
    op = lt if less else gt
    for char1, char2 in zip(string1, string2):
        ordinal1, ordinal2 = ord(char1), ord(char2)
        if ordinal1 == ordinal2:
            continue
        else:
            return op(ordinal1, ordinal2)
    return op(len(string1), len(string2))

edited Oct 21 '22 at 10:48

answered Sep 10 '17 at 22:23

MSeifert

145,886
38
333
352

In both cases your loops terminate at the end of whichever string is shortest. You cannot then just return `False` unconditionally, that is wrong if `string1` is longer than `string2` (e.g. `doggy` & `dog` ), you need to check.... – JonBrave Feb 26 '19 at 13:05
@JonBrave Seems correct what you say. You mean adding a `if len(string1) < len(string2): return True` before the final `return False`? I'm not at a computer currently, so I cannot check. Will do later :) – MSeifert Feb 26 '19 at 13:40
Yes, you need some test at the end deciding whether to return `False` or `True` according as either you have reached the end of both strings (`False`, because they are equal), `string1` is longer (also `False`) or `string2` is longer (`True`). The whole thing could be coded as `return len(string1) < len(string2)`. – JonBrave Feb 27 '19 at 09:04
Why not just `else: return op(ordinal1, ordinal2)` instead of the `elif op(ordinal1, ordinal2): return True else: return False`? – Géry Ogam Sep 28 '20 at 14:38
@Maggyero It was mostly because the equivalent C implementation had an `else if` but your suggestion is equivalent, shorter, and also in my opinion better style. Thanks for pointing it out. – MSeifert Sep 28 '20 at 17:07
You are welcome! The elegant `zip` can also be used in the first snippet I guess. – Géry Ogam Sep 28 '20 at 17:09
There is a tiny bug: `ordinal1, ordinal2 = ord(char1), ord(char1)` should be `ordinal1, ordinal2 = ord(char1), ord(char2)` => `ord(char1)` -> `ord(char2)` – Taher A. Ghaleb Oct 19 '22 at 03:29

score 4 · Answer 6 · answered Jan 26 '11 at 16:20

4

This is a lexicographical ordering. It just puts things in dictionary order.

answered Jan 26 '11 at 16:20

Michael J. Barber

24,518
9
68
88

This is actually wrong, because a dictionary doesn't make a difference between lowercase and uppercase letters, for instance `'a' > 'z'` is `True` while `'a' > 'Z'`is `False` – seb Jul 10 '17 at 16:57

score 2 · Answer 7 · answered Jan 26 '11 at 16:22

2

Strings are compared lexicographically using the numeric equivalents (the result of the built-in function ord()) of their characters. Unicode and 8-bit strings are fully interoperable in this behavior.

answered Jan 26 '11 at 16:22

Senthil Kumaran

54,681
14
94
131

How are strings compared?

7 Answers7

Linked

Related