1

I have a varchar column in a MySQL database that stores some text which can contain any character and number.

The schema uses utf8 as the character set and utf8_unicode_ci as the collation. InnoDB is used as the database engine as I need to use transactions.

What I would like to do is to be able to ORDER BY that column in a natural way. For example, here's some sample data:

12234 some random text
my text 23
mytext3
123456abcd
text23
text1
text111
text33

After inserting the test data into a test table, I run a query and ORDER BY the varchar column ASC:

SELECT * FROM `test` WHERE 1 ORDER BY data ASC;

The problem is that the "size" of the numbers aren't taken into account:

12234 some random text
123456abcd
my text 23
mytext3
text1
text111 <-------
text23
text33

I then found this question here which provided an (almost) solution:

SELECT * FROM `test` WHERE 1 ORDER BY data * 1 DESC;

123456abcd             <------ These 2 should
12234 some random text <------ be swapped
my text 23
mytext3
text23
text1
text33
text111

Another downside of the above is that the above could not use any indexes because there is an opreation on each row.

I am also not sure what the effects would be like once we introduce non-latin characters.

Is there a performant way to get MySQL to perform natural language sorting?

Community
  • 1
  • 1
F21
  • 32,163
  • 26
  • 99
  • 170
  • I am a little bit confused. If it is sort by DESC, wont 123456 shd come before 12234? – sel Jul 24 '12 at 01:15
  • That's because I am doing `ORDER BY data * 1` as per the question I linked to. – F21 Jul 24 '12 at 01:18
  • possible duplicate of [Natural Sort in MySQL](http://stackoverflow.com/questions/153633/natural-sort-in-mysql) – outis Jul 24 '12 at 01:41

1 Answers1

0

You asked:

   Is there a performant way to get MySQL to perform natural language sorting?

Short answer: no.

Longer answer:

Presumably you want MySQL to collate your date in a way that handles non-numeric text using the chosen collation and numeric text as if it were binary numbers.

You're going to need to create a surrogate key of some kind for collating, and store it in its own column. You could do this by transforming each text string like this, and load both the original string and the surrogate key string into the dbms.

Original Text                   Surrogate Key Text
12234 some random text          0000012234 some random text
my text 23                      my text 0000000023
mytext3                         mytext0000000003
123456abcd                      0000123456abcd
text23                          text0000000023
text1                           text0000000001
text111                         text0000000111
text33                          text0000000033
1text123                        0000000001text0000000123
2text124                        0000000002text0000000124

Notice that each chunk of numeric data in your samples is, in this example, stuffed into a ten-digit decimal number.

This is easiest to do in your application when you load the DBMS. You can write a simple string-parsing routine that will do it. You didn't tell us what application language you're using.

If you're only collating a limited number of records in each query you make, it may also be possible to do it in your application after you retrieve the records.

O. Jones
  • 103,626
  • 17
  • 118
  • 172