99

Is there an elegant way to have performant, natural sorting in a MySQL database?

For example if I have this data set:

  • Final Fantasy
  • Final Fantasy 4
  • Final Fantasy 10
  • Final Fantasy 12
  • Final Fantasy 12: Chains of Promathia
  • Final Fantasy Adventure
  • Final Fantasy Origins
  • Final Fantasy Tactics

Any other elegant solution than to split up the games' names into their components

  • Title: "Final Fantasy"
  • Number: "12"
  • Subtitle: "Chains of Promathia"

to make sure that they come out in the right order? (10 after 4, not before 2).

Doing so is a pain in the a** because every now and then there's another game that breaks that mechanism of parsing the game title (e.g. "Warhammer 40,000", "James Bond 007")

BlaM
  • 28,465
  • 32
  • 91
  • 105
  • 31
    Chains of Promathia is related to 11. – Flame Feb 09 '09 at 22:47
  • Possible duplicate of [MySQL 'Order By' - sorting alphanumeric correctly](https://stackoverflow.com/questions/8557172/mysql-order-by-sorting-alphanumeric-correctly) – Christian Nov 28 '17 at 17:33
  • Related: https://stackoverflow.com/questions/48600059/using-mysql-sort-varchar-column-numerically-with-cast-as-unsigned-when-the-colum – Paul Spiegel Feb 18 '20 at 14:11

22 Answers22

98

Here is a quick solution:

SELECT alphanumeric, 
       integer
FROM sorting_test
ORDER BY LENGTH(alphanumeric), alphanumeric
C B
  • 1,677
  • 6
  • 18
  • 20
slotishtype
  • 2,715
  • 7
  • 32
  • 47
  • 54
    That's nice if everything is "Final Fantasy", but it puts "Goofy" ahead of the FF suite. – fortboise Dec 07 '11 at 19:22
  • 4
    This solution does not works all the time. It breaks sometimes. You should rather use this one: http://stackoverflow.com/a/12257917/384864 – Borut Tomazin Jan 27 '13 at 17:38
  • 8
    Piling kludge upon kludge: `SELECT alphanumeric, integer FROM sorting_test ORDER BY SOUNDEX(alphanumeric), LENGTH(alphanumeric), alphanumeric`. If this works at all, it's because SOUNDEX conveniently discards the numbers, thus ensuring that e.g. `apple1` comes before `z1`. – offby1 Oct 09 '14 at 20:18
  • great solution, thanks, though I had to do switch `alphanmuric`,`length(alphanumeric)` to avoid "Goofy" before "Final Fantasy" – Asped Oct 29 '14 at 16:26
  • 1
    @offby1 suggestion only works if the text is 100% written in english as `SOUNDEX()` is designed to only work correctly on english words. – Raymond Nijland Aug 16 '19 at 16:37
  • That's sad we should resort to such tricks, nice trick anyway. – YSC Jan 07 '21 at 13:22
  • This is a great and very simple answer to natural sorting numbers. Thank you. – dresende Feb 05 '21 at 15:04
62

Just found this:

SELECT names FROM your_table ORDER BY games + 0 ASC

Does a natural sort when the numbers are at the front, might work for middle as well.

  • 3
    I have not tried it, but I seriously doubt it. The reason it works with the number at the front is because `games` is used as in a numeric context and thus converted to a number before comparison. If in the middle it will always convert to 0 and the sorting will become pseudo-random. – manixrock May 03 '11 at 13:59
  • 1
    This is not natural sort. Rather take a look at this working solution: http://stackoverflow.com/a/12257917/384864 – Borut Tomazin Jan 27 '13 at 17:39
  • @fedir This worked well for me too. I'm not even entirely sure exactly why this works. Any chance of an explanation markletp? – BizNuge Apr 04 '14 at 13:12
  • Just had a quick investigation of this and I get it. I didn't even realise MySQL would do this sort of casting just by using a mathematical operator on a string! Cool thing is that it just returns zer0 in the case of there being no integer at the front of the string to "cast". Thanks for this! ---> SELECT ADDRESS, (ADDRESS * 1) as _cast FROM premises WHERE POSTCODE LIKE 'NE1%' ORDER BY ADDRESS * 1 ASC, ADDRESS LIMIT 100000; – BizNuge Apr 04 '14 at 13:34
  • I used this method in my Magento store to sort categories naturally. The code looked like this: `addAttributeToSort('name', ASC)` and now it looks like this: `addAttributeToSort('name + 0', ASC)`. Now when I have categories that begin with 1 or 2 digit numbers, they're naturally sorted. Brilliant answer, thanks for sharing! – NotJay Dec 29 '15 at 15:55
  • **Edit**: However, I should note that I needed this for one category only so I implemented an if statement for that category id. Otherwise, the text-only categories would be out of order. – NotJay Dec 29 '15 at 16:09
  • 2
    This does not actually work when the numbers are in the middle such as "Final Fantasy 100" or "Final Fantasy 2". "Final Fantasy 100" will show first. It does however work when the integer is first "100 Final Fantasy" – dwenaus Mar 02 '16 at 21:54
  • THIS IS VERY DANGEROUS! On my query it worked fine, I upvoted the answer BUT when I refreshed, it didn't work! Then I go ahead and refresh the query 100 times, randomly it works and doesn't work for the SAME query! Don't rely on this! My table has a number at the end and here is my query: SELECT TABLE_NAME FROM information_schema.TABLES WHERE TABLE_SCHEMA = 'my_database' AND TABLE_NAME LIKE '%my_table%' ORDER BY TABLE_NAME+0 DESC LIMIT 1 – Tarik Aug 25 '16 at 15:45
  • OMG, best solution ever! You have saved me there, man! I was looking for a simple way to sort house numbers, which could have a lot of weird stuff after the integer, and I had to do it with the oldest both mysql and php. – Vitalij Apr 16 '17 at 12:16
58

Same function as posted by @plalx, but rewritten to MySQL:

DROP FUNCTION IF EXISTS `udf_FirstNumberPos`;
DELIMITER ;;
CREATE FUNCTION `udf_FirstNumberPos` (`instring` varchar(4000)) 
RETURNS int
LANGUAGE SQL
DETERMINISTIC
NO SQL
SQL SECURITY INVOKER
BEGIN
    DECLARE position int;
    DECLARE tmp_position int;
    SET position = 5000;
    SET tmp_position = LOCATE('0', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF; 
    SET tmp_position = LOCATE('1', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
    SET tmp_position = LOCATE('2', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
    SET tmp_position = LOCATE('3', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
    SET tmp_position = LOCATE('4', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
    SET tmp_position = LOCATE('5', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
    SET tmp_position = LOCATE('6', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
    SET tmp_position = LOCATE('7', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
    SET tmp_position = LOCATE('8', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
    SET tmp_position = LOCATE('9', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;

    IF (position = 5000) THEN RETURN 0; END IF;
    RETURN position;
END
;;

DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
DELIMITER ;;
CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50)) 
RETURNS varchar(4000)
LANGUAGE SQL
DETERMINISTIC
NO SQL
SQL SECURITY INVOKER
BEGIN
    DECLARE sortString varchar(4000);
    DECLARE numStartIndex int;
    DECLARE numEndIndex int;
    DECLARE padLength int;
    DECLARE totalPadLength int;
    DECLARE i int;
    DECLARE sameOrderCharsLen int;

    SET totalPadLength = 0;
    SET instring = TRIM(instring);
    SET sortString = instring;
    SET numStartIndex = udf_FirstNumberPos(instring);
    SET numEndIndex = 0;
    SET i = 1;
    SET sameOrderCharsLen = CHAR_LENGTH(sameOrderChars);

    WHILE (i <= sameOrderCharsLen) DO
        SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
        SET i = i + 1;
    END WHILE;

    WHILE (numStartIndex <> 0) DO
        SET numStartIndex = numStartIndex + numEndIndex;
        SET numEndIndex = numStartIndex;

        WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
            SET numEndIndex = numEndIndex + 1;
        END WHILE;

        SET numEndIndex = numEndIndex - 1;

        SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);

        IF padLength < 0 THEN
            SET padLength = 0;
        END IF;

        SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));

        SET totalPadLength = totalPadLength + padLength;
        SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, CHAR_LENGTH(instring) - numEndIndex));
    END WHILE;

    RETURN sortString;
END
;;

Usage:

SELECT name FROM products ORDER BY udf_NaturalSortFormat(name, 10, ".")
Richard Toth
  • 781
  • 6
  • 7
  • 6
    This is the only solution that really works. I've also tested drupals code but it fails sometimes. Thanks man! – Borut Tomazin Jan 27 '13 at 17:36
  • Anyone use this on really big tables 10+ million? – Mark Steudel Jan 31 '17 at 23:58
  • 6
    @MarkSteudel We use a function similar to this (though not this exact one) for natural sorting on several tables, the largest of which is ~5 million rows. However, we don't call it directly in our queries but instead use it to set the value of a `nat_name` column. We use a trigger to run the function every time a row is updated. This approach gives you natural sorting with no real performance cost at the expense of an additional column. – Jacob Jan 18 '19 at 22:39
  • this works, sorting numbers before letters, and can be implemented in Drupal using hook_views_query_alter, using something similiar to this `if ($query->orderby[0]["field"] === "node_field_data.title") { $orderBySql = " udf_NaturalSortFormat(node_field_data.title, 10, '.') "; $query->orderby = []; $query->addOrderBy(NULL, $orderBySql, $query->orderby[0]["direction"], 'title_natural'); array_unshift($query->orderby, end($query->orderby)); }` – realgt Jul 22 '19 at 16:43
  • 1
    Here's a version for SQL Server, if anyone is interested: https://gist.github.com/Xriuk/93288e4527dbe0ce1936e748dfdfb7df – Xriuk Jun 17 '22 at 08:21
23

I think this is why a lot of things are sorted by release date.

A solution could be to create another column in your table for the "SortKey". This could be a sanitized version of the title which conforms to a pattern you create for easy sorting or a counter.

Michael Haren
  • 105,752
  • 40
  • 168
  • 205
18

I've written this function for MSSQL 2000 a while ago:

/**
 * Returns a string formatted for natural sorting. This function is very useful when having to sort alpha-numeric strings.
 *
 * @author Alexandre Potvin Latreille (plalx)
 * @param {nvarchar(4000)} string The formatted string.
 * @param {int} numberLength The length each number should have (including padding). This should be the length of the longest number. Defaults to 10.
 * @param {char(50)} sameOrderChars A list of characters that should have the same order. Ex: '.-/'. Defaults to empty string.
 *
 * @return {nvarchar(4000)} A string for natural sorting.
 * Example of use: 
 * 
 *      SELECT Name FROM TableA ORDER BY Name
 *  TableA (unordered)              TableA (ordered)
 *  ------------                    ------------
 *  ID  Name                    ID  Name
 *  1.  A1.                 1.  A1-1.       
 *  2.  A1-1.                   2.  A1.
 *  3.  R1      -->         3.  R1
 *  4.  R11                 4.  R11
 *  5.  R2                  5.  R2
 *
 *  
 *  As we can see, humans would expect A1., A1-1., R1, R2, R11 but that's not how SQL is sorting it.
 *  We can use this function to fix this.
 *
 *      SELECT Name FROM TableA ORDER BY dbo.udf_NaturalSortFormat(Name, default, '.-')
 *  TableA (unordered)              TableA (ordered)
 *  ------------                    ------------
 *  ID  Name                    ID  Name
 *  1.  A1.                 1.  A1.     
 *  2.  A1-1.                   2.  A1-1.
 *  3.  R1      -->         3.  R1
 *  4.  R11                 4.  R2
 *  5.  R2                  5.  R11
 */
CREATE FUNCTION dbo.udf_NaturalSortFormat(
    @string nvarchar(4000),
    @numberLength int = 10,
    @sameOrderChars char(50) = ''
)
RETURNS varchar(4000)
AS
BEGIN
    DECLARE @sortString varchar(4000),
        @numStartIndex int,
        @numEndIndex int,
        @padLength int,
        @totalPadLength int,
        @i int,
        @sameOrderCharsLen int;

    SELECT 
        @totalPadLength = 0,
        @string = RTRIM(LTRIM(@string)),
        @sortString = @string,
        @numStartIndex = PATINDEX('%[0-9]%', @string),
        @numEndIndex = 0,
        @i = 1,
        @sameOrderCharsLen = LEN(@sameOrderChars);

    -- Replace all char that has to have the same order by a space.
    WHILE (@i <= @sameOrderCharsLen)
    BEGIN
        SET @sortString = REPLACE(@sortString, SUBSTRING(@sameOrderChars, @i, 1), ' ');
        SET @i = @i + 1;
    END

    -- Pad numbers with zeros.
    WHILE (@numStartIndex <> 0)
    BEGIN
        SET @numStartIndex = @numStartIndex + @numEndIndex;
        SET @numEndIndex = @numStartIndex;

        WHILE(PATINDEX('[0-9]', SUBSTRING(@string, @numEndIndex, 1)) = 1)
        BEGIN
            SET @numEndIndex = @numEndIndex + 1;
        END

        SET @numEndIndex = @numEndIndex - 1;

        SET @padLength = @numberLength - (@numEndIndex + 1 - @numStartIndex);

        IF @padLength < 0
        BEGIN
            SET @padLength = 0;
        END

        SET @sortString = STUFF(
            @sortString,
            @numStartIndex + @totalPadLength,
            0,
            REPLICATE('0', @padLength)
        );

        SET @totalPadLength = @totalPadLength + @padLength;
        SET @numStartIndex = PATINDEX('%[0-9]%', RIGHT(@string, LEN(@string) - @numEndIndex));
    END

    RETURN @sortString;
END

GO
plalx
  • 42,889
  • 6
  • 74
  • 90
  • @MarkSteudel You would have to give it a go and test it for yourself. At worse you could always cache the formatted values. That's probably what I would do for large tables because you could index the field as well. – plalx Feb 01 '17 at 00:48
15

MySQL doesn't allow this sort of "natural sorting", so it looks like the best way to get what you're after is to split your data set up as you've described above (separate id field, etc), or failing that, perform a sort based on a non-title element, indexed element in your db (date, inserted id in the db, etc).

Having the db do the sorting for you is almost always going to be quicker than reading large data sets into your programming language of choice and sorting it there, so if you've any control at all over the db schema here, then look at adding easily-sorted fields as described above, it'll save you a lot of hassle and maintenance in the long run.

Requests to add a "natural sort" come up from time to time on the MySQL bugs and discussion forums, and many solutions revolve around stripping out specific parts of your data and casting them for the ORDER BY part of the query, e.g.

SELECT * FROM table ORDER BY CAST(mid(name, 6, LENGTH(c) -5) AS unsigned) 

This sort of solution could just about be made to work on your Final Fantasy example above, but isn't particularly flexible and unlikely to extend cleanly to a dataset including, say, "Warhammer 40,000" and "James Bond 007" I'm afraid.

ConroyP
  • 40,958
  • 16
  • 80
  • 86
10

So, while I know that you have found a satisfactory answer, I was struggling with this problem for awhile, and we'd previously determined that it could not be done reasonably well in SQL and we were going to have to use javascript on a JSON array.

Here's how I solved it just using SQL. Hopefully this is helpful for others:

I had data such as:

Scene 1
Scene 1A
Scene 1B
Scene 2A
Scene 3
...
Scene 101
Scene XXA1
Scene XXA2

I actually didn't "cast" things though I suppose that may also have worked.

I first replaced the parts that were unchanging in the data, in this case "Scene ", and then did a LPAD to line things up. This seems to allow pretty well for the alpha strings to sort properly as well as the numbered ones.

My ORDER BY clause looks like:

ORDER BY LPAD(REPLACE(`table`.`column`,'Scene ',''),10,'0')

Obviously this doesn't help with the original problem which was not so uniform - but I imagine this would probably work for many other related problems, so putting it out there.

FilmJ
  • 2,011
  • 3
  • 19
  • 27
  • The `LPAD()` hint was very helpful. I have words and numbers to sort, with `LPAD` I could sort the numbers naturally. And using `CONCAT` I ignore non-numbers. My query looks like this (alias is the column to sort): `IF(CONCAT("",alias*1)=alias, LPAD(alias,5,"0"), alias) ASC;` – Avatar Feb 05 '20 at 06:44
6
  1. Add a Sort Key (Rank) in your table. ORDER BY rank

  2. Utilise the "Release Date" column. ORDER BY release_date

  3. When extracting the data from SQL, make your object do the sorting, e.g., if extracting into a Set, make it a TreeSet, and make your data model implement Comparable and enact the natural sort algorithm here (insertion sort will suffice if you are using a language without collections) as you'll be reading the rows from SQL one by one as you create your model and insert it into the collection)

JeeBee
  • 17,476
  • 5
  • 50
  • 60
6

Regarding the best response from Richard Toth https://stackoverflow.com/a/12257917/4052357

Watch out for UTF8 encoded strings that contain 2byte (or more) characters and numbers e.g.

12 南新宿

Using MySQL's LENGTH() in udf_NaturalSortFormat function will return the byte length of the string and be incorrect, instead use CHAR_LENGTH() which will return the correct character length.

In my case using LENGTH() caused queries to never complete and result in 100% CPU usage for MySQL

DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
DELIMITER ;;
CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50)) 
RETURNS varchar(4000)
LANGUAGE SQL
DETERMINISTIC
NO SQL
SQL SECURITY INVOKER
BEGIN
    DECLARE sortString varchar(4000);
    DECLARE numStartIndex int;
    DECLARE numEndIndex int;
    DECLARE padLength int;
    DECLARE totalPadLength int;
    DECLARE i int;
    DECLARE sameOrderCharsLen int;

    SET totalPadLength = 0;
    SET instring = TRIM(instring);
    SET sortString = instring;
    SET numStartIndex = udf_FirstNumberPos(instring);
    SET numEndIndex = 0;
    SET i = 1;
    SET sameOrderCharsLen = CHAR_LENGTH(sameOrderChars);

    WHILE (i <= sameOrderCharsLen) DO
        SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
        SET i = i + 1;
    END WHILE;

    WHILE (numStartIndex <> 0) DO
        SET numStartIndex = numStartIndex + numEndIndex;
        SET numEndIndex = numStartIndex;

        WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
            SET numEndIndex = numEndIndex + 1;
        END WHILE;

        SET numEndIndex = numEndIndex - 1;

        SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);

        IF padLength < 0 THEN
            SET padLength = 0;
        END IF;

        SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));

        SET totalPadLength = totalPadLength + padLength;
        SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, CHAR_LENGTH(instring) - numEndIndex));
    END WHILE;

    RETURN sortString;
END
;;

p.s. I would have added this as a comment to the original but I don't have enough reputation (yet)

Community
  • 1
  • 1
Luke Hoggett
  • 61
  • 1
  • 2
5

To order:
0
1
2
10
23
101
205
1000
a
aac
b
casdsadsa
css

Use this query:

SELECT 
    column_name 
FROM 
    table_name 
ORDER BY
    column_name REGEXP '^\d*[^\da-z&\.\' \-\"\!\@\#\$\%\^\*\(\)\;\:\\,\?\/\~\`\|\_\-]' DESC, 
    column_name + 0, 
    column_name;
Guma
  • 155
  • 2
  • 4
4

Another option is to do the sorting in memory after pulling the data from mysql. While it won't be the best option from a performance standpoint, if you are not sorting huge lists you should be fine.

If you take a look at Jeff's post, you can find plenty of algorithms for what ever language you might be working with. Sorting for Humans : Natural Sort Order

Cœur
  • 37,241
  • 25
  • 195
  • 267
Bob
  • 97,670
  • 29
  • 122
  • 130
4

Add a field for "sort key" that has all strings of digits zero-padded to a fixed length and then sort on that field instead.

If you might have long strings of digits, another method is to prepend the number of digits (fixed-width, zero-padded) to each string of digits. For example, if you won't have more than 99 digits in a row, then for "Super Blast 10 Ultra" the sort key would be "Super Blast 0210 Ultra".

tye
  • 1,157
  • 9
  • 11
4

If you do not want to reinvent the wheel or have a headache with lot of code that does not work, just use Drupal Natural Sort ... Just run the SQL that comes zipped (MySQL or Postgre), and that's it. When making a query, simply order using:

... ORDER BY natsort_canon(column_name, 'natural')
  • Thanks for this, I've been trying all sorts of solutions (ha ha see what I did there?) but none of them really worked for all the data I had. The drupal function worked like a charm. Thanks for posting. – Ben Hitchcock Mar 13 '18 at 08:15
  • this works but sorts numbers at the end (A-Z then 0-9) – realgt Jul 22 '19 at 15:42
3

You can also create in a dynamic way the "sort column" :

SELECT name, (name = '-') boolDash, (name = '0') boolZero, (name+0 > 0) boolNum 
FROM table 
ORDER BY boolDash DESC, boolZero DESC, boolNum DESC, (name+0), name

That way, you can create groups to sort.

In my query, I wanted the '-' in front of everything, then the numbers, then the text. Which could result in something like :

-
0    
1
2
3
4
5
10
13
19
99
102
Chair
Dog
Table
Windows

That way you don't have to maintain the sort column in the correct order as you add data. You can also change your sort order depending on what you need.

antoine
  • 244
  • 2
  • 4
  • I don't know how performant this would be. I am using it all the time without any inconveniences. My database isn't big tho. – antoine Oct 17 '13 at 12:39
3

A lot of other answers I see here (and in the duplicate questions) basically only work for very specifically formatted data, e.g. a string that's entirely a number, or for which there's a fixed-length alphabetic prefix. This isn't going to work in the general case.

It's true that there's not really any way to implement a 100% general nat-sort in MySQL, because to do it what you really need is a modified comparison function, that switches between lexicographic sorting of the strings and numeric sort if/when it encounters a number. Such code could implement any algorithm you could desire for recognising and comparing the numeric portions within two strings. Unfortunately, though, the comparison function in MySQL is internal to its code, and cannot be changed by the user.

This leaves a hack of some kind, where you try to create a sort key for your string in which the numeric parts are re-formatted so that the standard lexicographic sort actually sorts them the way you want.

For plain integers up to some maximum number of digits, the obvious solution is to simply left-pad them with zeros so that they're all fixed width. This is the approach taken by the Drupal plugin, and the solutions of @plalx / @RichardToth. (@Christian has a different and much more complex solution, but it offers no advantages that I can see).

As @tye points out, you can improve on this by prepending a fixed-digit length to each number, rather than simply left-padding it. There's much, much more you can improve on, though, even given the limitations of what is essentially an awkward hack. Yet, there doesn't seem to be any pre-built solutions out there!

For example, what about:

  • Plus and minus signs? +10 vs 10 vs -10
  • Decimals? 8.2, 8.5, 1.006, .75
  • Leading zeros? 020, 030, 00000922
  • Thousand separators? "1,001 Dalmations" vs "1001 Dalmations"
  • Version numbers? MariaDB v10.3.18 vs MariaDB v10.3.3
  • Very long numbers? 103,768,276,592,092,364,859,236,487,687,870,234,598.55

Extending on @tye's method, I've created a fairly compact NatSortKey() stored function that will convert an arbitrary string into a nat-sort key, and that handles all of the above cases, is reasonably efficient, and preserves a total sort-order (no two different strings have sort keys that compare equal). A second parameter can be used to limit the number of numbers processed in each string (e.g. to the first 10 numbers, say), which can be used to ensure the output fits within a given length.

NOTE: Sort-key string generated with a given value of this 2nd parameter should only be sorted against other strings generated with the same value for the parameter, or else they might not sort correctly!

You can use it directly in ordering, e.g.

SELECT myString FROM myTable ORDER BY NatSortKey(myString,0);  ### 0 means process all numbers - resulting sort key might be quite long for certain inputs

But for efficient sorting of large tables, it's better to pre-store the sort key in another column (possibly with an index on it):

INSERT INTO myTable (myString,myStringNSK) VALUES (@theStringValue,NatSortKey(@theStringValue,10)), ...
...
SELECT myString FROM myTable ORDER BY myStringNSK;

[Ideally, you'd make this happen automatically by creating the key column as a computed stored column, using something like:

CREATE TABLE myTable (
...
myString varchar(100),
myStringNSK varchar(150) AS (NatSortKey(myString,10)) STORED,
...
KEY (myStringNSK),
...);

But for now neither MySQL nor MariaDB allow stored functions in computed columns, so unfortunately you can't yet do this.]


My function affects sorting of numbers only. If you want to do other sort-normalization things, such as removing all punctuation, or trimming whitespace off each end, or replacing multi-whitespace sequences with single spaces, you could either extend the function, or it could be done before or after NatSortKey() is applied to your data. (I'd recommend using REGEXP_REPLACE() for this purpose).

It's also somewhat Anglo-centric in that I assume '.' for a decimal point and ',' for the thousands-separator, but it should be easy enough to modify if you want the reverse, or if you want that to be switchable as a parameter.

It might be amenable to further improvement in other ways; for example it currently sorts negative numbers by absolute value, so -1 comes before -2, rather than the other way around. There's also no way to specify a DESC sort order for numbers while retaining ASC lexicographical sort for text. Both of these issues can be fixed with a little more work; I will updated the code if/when I get the time.

There are lots of other details to be aware of - including some critical dependencies on the chaset and collation that you're using - but I've put them all into a comment block within the SQL code. Please read this carefully before using the function for yourself!

So, here's the code. If you find a bug, or have an improvement I haven't mentioned, please let me know in the comments!


delimiter $$
CREATE DEFINER=CURRENT_USER FUNCTION NatSortKey (s varchar(100), n int) RETURNS varchar(350) DETERMINISTIC
BEGIN
/****
  Converts numbers in the input string s into a format such that sorting results in a nat-sort.
  Numbers of up to 359 digits (before the decimal point, if one is present) are supported.  Sort results are undefined if the input string contains numbers longer than this.
  For n>0, only the first n numbers in the input string will be converted for nat-sort (so strings that differ only after the first n numbers will not nat-sort amongst themselves).
  Total sort-ordering is preserved, i.e. if s1!=s2, then NatSortKey(s1,n)!=NatSortKey(s2,n), for any given n.
  Numbers may contain ',' as a thousands separator, and '.' as a decimal point.  To reverse these (as appropriate for some European locales), the code would require modification.
  Numbers preceded by '+' sort with numbers not preceded with either a '+' or '-' sign.
  Negative numbers (preceded with '-') sort before positive numbers, but are sorted in order of ascending absolute value (so -7 sorts BEFORE -1001).
  Numbers with leading zeros sort after the same number with no (or fewer) leading zeros.
  Decimal-part-only numbers (like .75) are recognised, provided the decimal point is not immediately preceded by either another '.', or by a letter-type character.
  Numbers with thousand separators sort after the same number without them.
  Thousand separators are only recognised in numbers with no leading zeros that don't immediately follow a ',', and when they format the number correctly.
  (When not recognised as a thousand separator, a ',' will instead be treated as separating two distinct numbers).
  Version-number-like sequences consisting of 3 or more numbers separated by '.' are treated as distinct entities, and each component number will be nat-sorted.
  The entire entity will sort after any number beginning with the first component (so e.g. 10.2.1 sorts after both 10 and 10.995, but before 11)
  Note that The first number component in an entity like this is also permitted to contain thousand separators.

  To achieve this, numbers within the input string are prefixed and suffixed according to the following format:
  - The number is prefixed by a 2-digit base-36 number representing its length, excluding leading zeros.  If there is a decimal point, this length only includes the integer part of the number.
  - A 3-character suffix is appended after the number (after the decimals if present).
    - The first character is a space, or a '+' sign if the number was preceded by '+'.  Any preceding '+' sign is also removed from the front of the number.
    - This is followed by a 2-digit base-36 number that encodes the number of leading zeros and whether the number was expressed in comma-separated form (e.g. 1,000,000.25 vs 1000000.25)
    - The value of this 2-digit number is: (number of leading zeros)*2 + (1 if comma-separated, 0 otherwise)
  - For version number sequences, each component number has the prefix in front of it, and the separating dots are removed.
    Then there is a single suffix that consists of a ' ' or '+' character, followed by a pair base-36 digits for each number component in the sequence.

  e.g. here is how some simple sample strings get converted:
  'Foo055' --> 'Foo0255 02'
  'Absolute zero is around -273 centigrade' --> 'Absolute zero is around -03273 00 centigrade'
  'The $1,000,000 prize' --> 'The $071000000 01 prize'
  '+99.74 degrees' --> '0299.74+00 degrees'
  'I have 0 apples' --> 'I have 00 02 apples'
  '.5 is the same value as 0000.5000' --> '00.5 00 is the same value as 00.5000 08'
  'MariaDB v10.3.0018' --> 'MariaDB v02100130218 000004'

  The restriction to numbers of up to 359 digits comes from the fact that the first character of the base-36 prefix MUST be a decimal digit, and so the highest permitted prefix value is '9Z' or 359 decimal.
  The code could be modified to handle longer numbers by increasing the size of (both) the prefix and suffix.
  A higher base could also be used (by replacing CONV() with a custom function), provided that the collation you are using sorts the "digits" of the base in the correct order, starting with 0123456789.
  However, while the maximum number length may be increased this way, note that the technique this function uses is NOT applicable where strings may contain numbers of unlimited length.

  The function definition does not specify the charset or collation to be used for string-type parameters or variables:  The default database charset & collation at the time the function is defined will be used.
  This is to make the function code more portable.  However, there are some important restrictions:

  - Collation is important here only when comparing (or storing) the output value from this function, but it MUST order the characters " +0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" in that order for the natural sort to work.
    This is true for most collations, but not all of them, e.g. in Lithuanian 'Y' comes before 'J' (according to Wikipedia).
    To adapt the function to work with such collations, replace CONV() in the function code with a custom function that emits "digits" above 9 that are characters ordered according to the collation in use.

  - For efficiency, the function code uses LENGTH() rather than CHAR_LENGTH() to measure the length of strings that consist only of digits 0-9, '.', and ',' characters.
    This works for any single-byte charset, as well as any charset that maps standard ASCII characters to single bytes (such as utf8 or utf8mb4).
    If using a charset that maps these characters to multiple bytes (such as, e.g. utf16 or utf32), you MUST replace all instances of LENGTH() in the function definition with CHAR_LENGTH()

  Length of the output:

  Each number converted adds 5 characters (2 prefix + 3 suffix) to the length of the string. n is the maximum count of numbers to convert;
  This parameter is provided as a means to limit the maximum output length (to input length + 5*n).
  If you do not require the total-ordering property, you could edit the code to use suffixes of 1 character (space or plus) only; this would reduce the maximum output length for any given n.
  Since a string of length L has at most ((L+1) DIV 2) individual numbers in it (every 2nd character a digit), for n<=0 the maximum output length is (inputlength + 5*((inputlength+1) DIV 2))
  So for the current input length of 100, the maximum output length is 350.
  If changing the input length, the output length must be modified according to the above formula.  The DECLARE statements for x,y,r, and suf must also be modified, as the code comments indicate.
****/
  DECLARE x,y varchar(100);            # need to be same length as input s
  DECLARE r varchar(350) DEFAULT '';   # return value:  needs to be same length as return type
  DECLARE suf varchar(101);   # suffix for a number or version string. Must be (((inputlength+1) DIV 2)*2 + 1) chars to support version strings (e.g. '1.2.33.5'), though it's usually just 3 chars. (Max version string e.g. 1.2. ... .5 has ((length of input + 1) DIV 2) numeric components)
  DECLARE i,j,k int UNSIGNED;
  IF n<=0 THEN SET n := -1; END IF;   # n<=0 means "process all numbers"
  LOOP
    SET i := REGEXP_INSTR(s,'\\d');   # find position of next digit
    IF i=0 OR n=0 THEN RETURN CONCAT(r,s); END IF;   # no more numbers to process -> we're done
    SET n := n-1, suf := ' ';
    IF i>1 THEN
      IF SUBSTRING(s,i-1,1)='.' AND (i=2 OR SUBSTRING(s,i-2,1) RLIKE '[^.\\p{L}\\p{N}\\p{M}\\x{608}\\x{200C}\\x{200D}\\x{2100}-\\x{214F}\\x{24B6}-\\x{24E9}\\x{1F130}-\\x{1F149}\\x{1F150}-\\x{1F169}\\x{1F170}-\\x{1F189}]') AND (SUBSTRING(s,i) NOT RLIKE '^\\d++\\.\\d') THEN SET i:=i-1; END IF;   # Allow decimal number (but not version string) to begin with a '.', provided preceding char is neither another '.', nor a member of the unicode character classes: "Alphabetic", "Letter", "Block=Letterlike Symbols" "Number", "Mark", "Join_Control"
      IF i>1 AND SUBSTRING(s,i-1,1)='+' THEN SET suf := '+', j := i-1; ELSE SET j := i; END IF;   # move any preceding '+' into the suffix, so equal numbers with and without preceding "+" signs sort together
      SET r := CONCAT(r,SUBSTRING(s,1,j-1)); SET s = SUBSTRING(s,i);   # add everything before the number to r and strip it from the start of s; preceding '+' is dropped (not included in either r or s)
    END IF;
    SET x := REGEXP_SUBSTR(s,IF(SUBSTRING(s,1,1) IN ('0','.') OR (SUBSTRING(r,-1)=',' AND suf=' '),'^\\d*+(?:\\.\\d++)*','^(?:[1-9]\\d{0,2}(?:,\\d{3}(?!\\d))++|\\d++)(?:\\.\\d++)*+'));   # capture the number + following decimals (including multiple consecutive '.<digits>' sequences)
    SET s := SUBSTRING(s,LENGTH(x)+1);   # NOTE: LENGTH() can be safely used instead of CHAR_LENGTH() here & below PROVIDED we're using a charset that represents digits, ',' and '.' characters using single bytes (e.g. latin1, utf8)
    SET i := INSTR(x,'.');
    IF i=0 THEN SET y := ''; ELSE SET y := SUBSTRING(x,i); SET x := SUBSTRING(x,1,i-1); END IF;   # move any following decimals into y
    SET i := LENGTH(x);
    SET x := REPLACE(x,',','');
    SET j := LENGTH(x);
    SET x := TRIM(LEADING '0' FROM x);   # strip leading zeros
    SET k := LENGTH(x);
    SET suf := CONCAT(suf,LPAD(CONV(LEAST((j-k)*2,1294) + IF(i=j,0,1),10,36),2,'0'));   # (j-k)*2 + IF(i=j,0,1) = (count of leading zeros)*2 + (1 if there are thousands-separators, 0 otherwise)  Note the first term is bounded to <= base-36 'ZY' as it must fit within 2 characters
    SET i := LOCATE('.',y,2);
    IF i=0 THEN
      SET r := CONCAT(r,LPAD(CONV(LEAST(k,359),10,36),2,'0'),x,y,suf);   # k = count of digits in number, bounded to be <= '9Z' base-36
    ELSE   # encode a version number (like 3.12.707, etc)
      SET r := CONCAT(r,LPAD(CONV(LEAST(k,359),10,36),2,'0'),x);   # k = count of digits in number, bounded to be <= '9Z' base-36
      WHILE LENGTH(y)>0 AND n!=0 DO
        IF i=0 THEN SET x := SUBSTRING(y,2); SET y := ''; ELSE SET x := SUBSTRING(y,2,i-2); SET y := SUBSTRING(y,i); SET i := LOCATE('.',y,2); END IF;
        SET j := LENGTH(x);
        SET x := TRIM(LEADING '0' FROM x);   # strip leading zeros
        SET k := LENGTH(x);
        SET r := CONCAT(r,LPAD(CONV(LEAST(k,359),10,36),2,'0'),x);   # k = count of digits in number, bounded to be <= '9Z' base-36
        SET suf := CONCAT(suf,LPAD(CONV(LEAST((j-k)*2,1294),10,36),2,'0'));   # (j-k)*2 = (count of leading zeros)*2, bounded to fit within 2 base-36 digits
        SET n := n-1;
      END WHILE;
      SET r := CONCAT(r,y,suf);
    END IF;
  END LOOP;
END
$$
delimiter ;
Doin
  • 7,545
  • 4
  • 35
  • 37
  • I'm a beginner in MySQL and tried this. Got this error: "#1305 - FUNCTION mydatabase.REGEXP_INSTR does not exist". Any idea? – John T Nov 07 '19 at 23:09
  • For any other newbie out there. I didn't have MySQL 8.0 installed. It's needed for REGEXP_INSTR(and other REGEXP stuff). – John T Nov 08 '19 at 00:40
  • Just fixed a serious bug in NatSortKey: there was an incorrect regex character. If you've used this function yourself, please update your code! – Doin May 19 '20 at 06:27
2

Other answers are correct, but you may want to know that MariaDB 10.11 LTS has a natural_sort_key() function. The function is documented here.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
1

A simplified non-udf version of the best response of @plaix/Richard Toth/Luke Hoggett, which works only for the first integer in the field, is

SELECT name,
LEAST(
    IFNULL(NULLIF(LOCATE('0', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('1', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('2', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('3', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('4', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('5', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('6', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('7', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('8', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('9', name), 0), ~0)
) AS first_int
FROM table
ORDER BY IF(first_int = ~0, name, CONCAT(
    SUBSTR(name, 1, first_int - 1),
    LPAD(CAST(SUBSTR(name, first_int) AS UNSIGNED), LENGTH(~0), '0'),
    SUBSTR(name, first_int + LENGTH(CAST(SUBSTR(name, first_int) AS UNSIGNED)))
)) ASC
bonger
  • 119
  • 1
  • 5
1

I have tried several solutions but the actually it is very simple:

SELECT test_column FROM test_table ORDER BY LENGTH(test_column) DESC, test_column DESC

/* 
Result 
--------
value_1
value_2
value_3
value_4
value_5
value_6
value_7
value_8
value_9
value_10
value_11
value_12
value_13
value_14
value_15
...
*/
Tarik
  • 4,270
  • 38
  • 35
  • 1
    Works very well to sorting numbers in format `23-4244`. Thanks :) – Pyton Sep 01 '16 at 09:48
  • 1
    only works with this test data because the strings before the number are all the same. Try sticking in a value `z_99` in there and it will get put at the top but `z` comes after `v`. – Samuel Neff Aug 19 '17 at 17:12
  • @SamuelNeff please see SQL: ***ORDER BY LENGTH(test_column) DESC, test_column DESC*** so yes, because it will sort by length of the column first. This works well sorting a prefix group of table which otherwise you would not be able to sort with only "test_column DESC" – Tarik Aug 20 '17 at 13:48
1

If you're using PHP you can do the the natural sort in php.

$keys = array();
$values = array();
foreach ($results as $index => $row) {
   $key = $row['name'].'__'.$index; // Add the index to create an unique key.
   $keys[] = $key;
   $values[$key] = $row; 
}
natsort($keys);
$sortedValues = array(); 
foreach($keys as $index) {
  $sortedValues[] = $values[$index]; 
}

I hope MySQL will implement natural sorting in a future version, but the feature request (#1588) is open since 2003, So I wouldn't hold my breath.

Bob Fanger
  • 28,949
  • 7
  • 62
  • 78
  • Theoretically that's possible, but I would need to read all database records to my webserver first. – BlaM Mar 09 '11 at 17:33
  • Alternatively consider: `usort($mydata, function ($item1, $item2) { return strnatcmp($item1['key'], $item2['key']); });` (I have an associative array and sort by key.) Ref: https://stackoverflow.com/q/12426825/1066234 – Avatar Nov 14 '19 at 11:09
0

Also there is natsort. It is intended to be a part of a drupal plugin, but it works fine stand-alone.

Peter V. Mørch
  • 13,830
  • 8
  • 69
  • 103
0

Here is a simple one if titles only have the version as a number:

ORDER BY CAST(REGEXP_REPLACE(title, "[a-zA-Z]+", "") AS INT)';

Otherwise you can use simple SQL if you use a pattern (this pattern uses a # before the version):

create table titles(title);

insert into titles (title) values 
('Final Fantasy'),
('Final Fantasy #03'),
('Final Fantasy #11'),
('Final Fantasy #10'),
('Final Fantasy #2'),
('Bond 007 ##2'),
('Final Fantasy #01'),
('Bond 007'),
('Final Fantasy #11}');

select REGEXP_REPLACE(title, "#([0-9]+)", "\\1") as title from titles
ORDER BY REGEXP_REPLACE(title, "#[0-9]+", ""),
CAST(REGEXP_REPLACE(title, ".*#([0-9]+).*", "\\1") AS INT);     
+-------------------+
| title             |
+-------------------+
| Bond 007          |
| Bond 007 #2       |
| Final Fantasy     |
| Final Fantasy 01  |
| Final Fantasy 2   |
| Final Fantasy 03  |
| Final Fantasy 10  |
| Final Fantasy 11  |
| Final Fantasy 11} |
+-------------------+
8 rows in set, 2 warnings (0.001 sec)

You can use other patterns if needed. For example if you have a movie "I'm #1" and "I'm #1 part 2" then maybe wrap the version e.g. "Final Fantasy {11}"

Frank Forte
  • 2,031
  • 20
  • 19
-4

I know this topic is ancient but I think I've found a way to do this:

SELECT * FROM `table` ORDER BY 
CONCAT(
  GREATEST(
    LOCATE('1', name),
    LOCATE('2', name),
    LOCATE('3', name),
    LOCATE('4', name),
    LOCATE('5', name),
    LOCATE('6', name),
    LOCATE('7', name),
    LOCATE('8', name),
    LOCATE('9', name)
   ),
   name
) ASC

Scrap that, it sorted the following set incorrectly (It's useless lol):

Final Fantasy 1 Final Fantasy 2 Final Fantasy 5 Final Fantasy 7 Final Fantasy 7: Advent Children Final Fantasy 12 Final Fantasy 112 FF1 FF2

user1467716
  • 481
  • 4
  • 3