35

Following up on this question by Sivaram Chintalapudi, I'm interested in whether it's practical in PostgreSQL to do natural - or "humanized" - sorting " of strings that contain a mixture of multi-digit numbers and words/letters. There is no fixed pattern of words and numbers in the strings, and there may be more than one multi-digit number in a string.

The only place I've seen this done routinely is in the Mac OS's Finder, which sorts filenames containing mixed numbers and words naturally, placing "20" after "3", not before it.

The collation order desired would be produced by an algorithm that split each string into blocks at letter-number boundaries, then ordered each part, treating letter-blocks with normal collation and number-blocks as integers for collation purposes. So:

'AAA2fred' would become ('AAA',2,'fred') and 'AAA10bob' would become ('AAA',10,'bob'). These can then be sorted as desired:

regress=# WITH dat AS ( VALUES ('AAA',2,'fred'), ('AAA',10,'bob') )
regress-# SELECT dat FROM dat ORDER BY dat;
     dat      
--------------
 (AAA,2,fred)
 (AAA,10,bob)
(2 rows)

as compared to the usual string collation ordering:

regress=# WITH dat AS ( VALUES ('AAA2fred'), ('AAA10bob') )
regress-# SELECT dat FROM dat ORDER BY dat;
    dat     
------------
 (AAA10bob)
 (AAA2fred)
(2 rows)

However, the record comparison approach doesn't generalize because Pg won't compare ROW(..) constructs or records of unequal numbers of entries.

Given the sample data in this SQLFiddle the default en_AU.UTF-8 collation produces the ordering:

1A, 10A, 2A, AAA10B, AAA11B, AAA1BB, AAA20B, AAA21B, X10C10, X10C2, X1C1, X1C10, X1C3, X1C30, X1C4, X2C1

but I want:

1A, 2A, 10A, AAA1BB, AAA10B, AAA11B, AAA20B, AAA21B, X1C1, X1C3, X1C4, X1C10, X1C30, X2C1, X10C10, X10C2

I'm working with PostgreSQL 9.1 at the moment, but 9.2-only suggestions would be fine. I'm interested in advice on how to achieve an efficient string-splitting method, and how to then compare the resulting split data in the alternating string-then-number collation described. Or, of course, on entirely different and better approaches that don't require splitting strings.

PostgreSQL doesn't seem to support comparator functions, otherwise this could be done fairly easily with a recursive comparator and something like ORDER USING comparator_fn and a comparator(text,text) function. Alas, that syntax is imaginary.

Update: Blog post on the topic.

Community
  • 1
  • 1
Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • Note: IIRC, Linux has a "numsort" function to do this kind of thing. Without collation woes, of course. Intended for version strings. (could even stem from BSD ;-) – wildplasser Oct 18 '12 at 23:56
  • @wildplasser Interesting. I'm not seeing a manpage for it and there's no `\` in /usr/include, though, so I suspect it must have a somewhat different name. – Craig Ringer Oct 18 '12 at 23:57
  • It was from the top of my head. I'll do some googling. BRB. Could be implemented as a compariso-operator, BTW (without the array splitting) – wildplasser Oct 18 '12 at 23:59
  • 2
    **versionsort()** it is in the scandir man page. – wildplasser Oct 19 '12 at 00:09
  • 1
    @wildplasser Nice find. It looks like the actual function that does the work is [`strverscmp`](http://linux.about.com/library/cmd/blcmdl3_strverscmp.htm). So one possible approach is to write a new ordering operator as a C extension based on `strverscmp` - not impossible, but non-portable and probably not ideal. I'll put it on my "C extensions for fun" todo though (http://blog.ringerc.id.au/2012/08/generating-random-bytea-values-in.html). `strverscmp` doesn't respect the current collation, so it'd produce wrong results if the words were (say) polish, though. – Craig Ringer Oct 19 '12 at 00:11
  • I remembered because I tried to build a function myself (I like to construct state machines in my leasure time ;-) I could not find it here at home, maybe it is at work somewhere. It 'll be friday anyway tomorow. Good night! – wildplasser Oct 19 '12 at 00:14
  • There are some truly fascinating comments on that CodingHorror post (linked in question). For example, the Mac OS Finder's natural sort is provided by [the International Components for Unicode (ICU)](http://site.icu-project.org/) library using the `UCOL_NUMERIC_COLLATION` flag. – Craig Ringer Oct 19 '12 at 02:33

7 Answers7

19

Building on your test data, but this works with arbitrary data. This works with any number of elements in the string.

Register a composite type made up of one text and one integer value once per database. I call it ai:

CREATE TYPE ai AS (a text, i int);

The trick is to form an array of ai from each value in the column.

regexp_matches() with the pattern (\D*)(\d*) and the g option returns one row for every combination of letters and numbers. Plus one irrelevant dangling row with two empty strings '{"",""}' Filtering or suppressing it would just add cost. Aggregate this into an array, after replacing empty strings ('') with 0 in the integer component (as '' cannot be cast to integer).

NULL values sort first - or you have to special case them - or use the whole shebang in a STRICT function like @Craig proposes.

Postgres 9.4 or later

SELECT data
FROM   alnum
ORDER  BY ARRAY(SELECT ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai
                FROM regexp_matches(data, '(\D*)(\d*)', 'g') x)
        , data;

db<>fiddle here

Postgres 9.1 (original answer)

Tested with PostgreSQL 9.1.5, where regexp_replace() had a slightly different behavior.

SELECT data
FROM  (
    SELECT ctid, data, regexp_matches(data, '(\D*)(\d*)', 'g') AS x
    FROM   alnum
    ) x
GROUP  BY ctid, data   -- ctid as stand-in for a missing pk
ORDER  BY regexp_replace (left(data, 1), '[0-9]', '0')
        , array_agg(ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai)
        , data         -- for special case of trailing 0

Add regexp_replace (left(data, 1), '[1-9]', '0') as first ORDER BY item to take care of leading digits and empty strings.

If special characters like {}()"', can occur, you'd have to escape those accordingly.
@Craig's suggestion to use a ROW expression takes care of that.

BTW, this won't execute in sqlfiddle, but it does in my db cluster. JDBC is not up to it. sqlfiddle complains:

Method org.postgresql.jdbc3.Jdbc3Array.getArrayImpl(long,int,Map) is not yet implemented.

This has since been fixed: http://sqlfiddle.com/#!17/fad6e/1

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Funny; we were almost certainly working on these very similar approaches simultaneously. `\D` is much better than the character class I was using. – Craig Ringer Oct 19 '12 at 01:25
  • @CraigRinger: I guess, one has to end up somewhere near this solution (with plain SQL). So it's not that surprising that you were steering in the same direction. It *is* pretty advanced stuff, though. PostgreSQL art, kind of. :) – Erwin Brandstetter Oct 19 '12 at 01:29
  • 1
    Yeah, I managed to trigger that error too. It looks like PgJDBC doesn't like arrays of composite types. I'll add a test case to the PgJDBC test suite and TODO it on https://github.com/pgjdbc/pgjdbc/issues . Casting to `text` works around it. – Craig Ringer Oct 19 '12 at 01:46
  • Instead of the `replace` and `translate` it might be safer to use a ROW() constructor and array subscripting. See: http://sqlfiddle.com/#!12/b893e/8 – Craig Ringer Oct 19 '12 at 01:52
  • @CraigRinger: I updated with your idea to use a row expression, plus added `left(data, 1)` to fix the order for leading digits or empty strings. Only `NULL` values remain unsolved - they are filtered by the `regexp_matches()` step which produces "no row" for `NULL` values. You would have to `UNION ALL` `NULL` values back to the result if you want them in there. – Erwin Brandstetter Oct 19 '12 at 02:19
  • To handle NULLs I'd probably wrap it in a `STRICT IMMUTABLE` SQL function, as that'd not only handle NULLs automatically but also allow indexing on the expression. – Craig Ringer Oct 19 '12 at 02:22
  • @CraigRinger: `STRICT` would take care of it nicely - I see you are aiming for the index on an expression. In pure SQL you'd need the special case for `NULL`. BTW, there was still a glitch in the `ORDER BY` for leading numbers / empty strings. I think I silenced that for good now. – Erwin Brandstetter Oct 19 '12 at 02:47
  • Nice. Thanks; you don't need to go so all-out on this, but it's certainly interesting. Just posted about this topic btw: http://blog.ringerc.id.au/2012/10/natural-sorting-example-of-utility-of.html – Craig Ringer Oct 19 '12 at 02:48
15

I faced this same problem, and I wanted to wrap the solution in a function so I could re-use it easily. I created the following function to achieve a 'human style' sort order in Postgres.

CREATE OR REPLACE FUNCTION human_sort(text)
  RETURNS text[] AS
$BODY$   
  /* Split the input text into contiguous chunks where no numbers appear,
     and contiguous chunks of only numbers. For the numbers, add leading 
     zeros to 20 digits, so we can use one text array, but sort the 
     numbers as if they were big integers.

       For example, human_sort('Run 12 Miles') gives
            {'Run ', '00000000000000000012', ' Miles'}
  */
  select array_agg(
    case
      when a.match_array[1]::text is not null 
        then a.match_array[1]::text         
      else lpad(a.match_array[2]::text, 20::int, '0'::text)::text                                      
    end::text)
    from (
      select regexp_matches(
        case when $1 = '' then null else $1 end, E'(\\D+)|(\\d+)', 'g'
      ) AS match_array      
    ) AS a  
$BODY$
  LANGUAGE sql IMMUTABLE;

tested to work on Postgres 8.3.18 and 9.3.5

  • No recursion, should be faster than recursive solutions
  • Can be used in just the order by clause, don't have to deal with primary key or ctid
  • Works for any select (don't even need a PK or ctid)
  • Simpler than some other solutions, should be easier to extend and maintain
  • Suitable for use in a functional index to improve performance
  • Works on Postgres v8.3 or higher
  • Allows an unlimited number of text/number alternations in the input
  • Uses just one regex, should be faster than versions with multiple regexes
  • Numbers longer than 20 digits are ordered by their first 20 digits

Here's an example usage:

select * from (values 
  ('Books 1', 9),
  ('Book 20 Chapter 1', 8),
  ('Book 3 Suffix 1', 7),
  ('Book 3 Chapter 20', 6),
  ('Book 3 Chapter 2', 5),
  ('Book 3 Chapter 1', 4),
  ('Book 1 Chapter 20', 3),
  ('Book 1 Chapter 3', 2),
  ('Book 1 Chapter 1', 1),
  ('', 0),
  (null::text, 0)
) as a(name, sort)
order by human_sort(a.name)
-----------------------------
|name               |  sort |
-----------------------------
|                   |   0   |
|                   |   0   |
|Book 1 Chapter 1   |   1   |
|Book 1 Chapter 3   |   2   |
|Book 1 Chapter 20  |   3   |
|Book 3 Chapter 1   |   4   |
|Book 3 Chapter 2   |   5   |
|Book 3 Chapter 20  |   6   |
|Book 3 Suffix 1    |   7   |
|Book 20 Chapter 1  |   8   |
|Books 1            |   9   |
-----------------------------
TNelson
  • 176
  • 1
  • 4
  • +1 for contributing your useful input so clearly and helpfully. Thankyou. Now go upgrade! 8.3 is obsolete and it will soon be unsupported, with no security updates or bug fixes. It's also a pain to work with, missing all sorts of performance and usability features. – Craig Ringer Dec 19 '13 at 03:27
  • 1
    Tested and working on 10.4. This is the only non-extension solution I've found that works. As a plus, it requires no external types or tables and seems to work with any arbitrary data. – rintaun Sep 29 '18 at 03:18
  • Thanks rintaun! I have used this solution in production for the last 5 years and it's still going strong with no issues. – TNelson Oct 22 '18 at 15:37
  • Works perfectly – Kilisi Sep 28 '21 at 05:46
8

Adding this answer late because it looked like everyone else was unwrapping into arrays or some such. Seemed excessive.

CREATE FUNCTION rr(text,int) RETURNS text AS $$
SELECT regexp_replace(
    regexp_replace($1, '[0-9]+', repeat('0',$2) || '\&', 'g'), 
    '[0-9]*([0-9]{' || $2 || '})', 
    '\1', 
    'g'
)
$$ LANGUAGE sql;

SELECT t,rr(t,9) FROM mixed ORDER BY t;
      t       |             rr              
--------------+-----------------------------
 AAA02free    | AAA000000002free
 AAA10bob     | AAA000000010bob
 AAA2bbb03boo | AAA000000002bbb000000003boo
 AAA2bbb3baa  | AAA000000002bbb000000003baa
 AAA2fred     | AAA000000002fred
(5 rows)

(reverse-i-search)`OD': SELECT crypt('richpass','$2$08$aJ9ko0uKa^C1krIbdValZ.dUH8D0R0dj8mqte0Xw2FjImP5B86ugC');
richardh=> 
richardh=> SELECT t,rr(t,9) FROM mixed ORDER BY rr(t,9);
      t       |             rr              
--------------+-----------------------------
 AAA2bbb3baa  | AAA000000002bbb000000003baa
 AAA2bbb03boo | AAA000000002bbb000000003boo
 AAA2fred     | AAA000000002fred
 AAA02free    | AAA000000002free
 AAA10bob     | AAA000000010bob
(5 rows)

I'm not claiming two regexps are the most efficient way to do this, but rr() is immutable (for fixed length) so you can index it. Oh - this is 9.1

Of course, with plperl you could just evaluate the replacement to pad/trim it in one go. But then with perl you've always got just-one-more-option (TM) than any other approach :-)

Richard Huxton
  • 21,516
  • 3
  • 39
  • 51
5

The following function splits a string into an array of (word,number) pairs of arbitrary length. If the string begins with a number then the first entry will have a NULL word.

CREATE TYPE alnumpair AS (wordpart text,numpart integer);

CREATE OR REPLACE FUNCTION regexp_split_numstring_depth_pairs(instr text)
RETURNS alnumpair[] AS $$
WITH x(match) AS (SELECT regexp_matches($1, '(\D*)(\d+)(.*)'))
SELECT
  ARRAY[(CASE WHEN match[1] = '' THEN '0' ELSE match[1] END, match[2])::alnumpair] || (CASE 
  WHEN match[3] = '' THEN
    ARRAY[]::alnumpair[]
  ELSE 
    regexp_split_numstring_depth_pairs(match[3]) 
  END)
FROM x;$$ LANGUAGE 'sql' IMMUTABLE;

allowing PostgreSQL's composite type sorting to come into play:

SELECT data FROM alnum ORDER BY regexp_split_numstring_depth_pairs(data);

and producing the expected result, as per this SQLFiddle. I've adopted Erwin's substitution of 0 for the empty string in all strings beginning with a number so that numbers sort first; it's cleaner than using ORDER BY left(data,1), regexp_split_numstring_depth_pairs(data).

While the function is probably horrifically slow it can at least be used in an expression index.

That was fun!

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • 1
    I made numbers sort first by substituting `0` instead of `NULL`, which sorts first within numbers and letters alike, at least in all collations known to me. – Erwin Brandstetter Oct 19 '12 at 01:41
  • @ErwinBrandstetter Good idea using `0`. – Craig Ringer Oct 19 '12 at 02:01
  • @CraigRinger I plan on using this function for one of my production servers, would it be a good idea. Also what would be the best way to index the colum. I need this for only 1 table having a varchar(250) column – Akash Feb 16 '13 at 17:43
  • @Akash If at all possible, fix your data model so that this isn't necessary. If you can't, then you could create a functional index on the column and then `ORDER BY` that function expression. Performance ... well, try it and see. – Craig Ringer Feb 17 '13 at 08:25
3
create table dat(val text)
insert into dat ( VALUES ('BBB0adam'), ('AAA10fred'), ('AAA2fred'), ('AAA2bob') );

select 
  array_agg( case when z.x[1] ~ E'\\d' then lpad(z.x[1],10,'0') else z.x[1] end ) alnum_key
from (
  SELECT ctid, regexp_matches(dat.val, E'(\\D+|\\d+)','g') as x
  from dat
) z
group by z.ctid
order by alnum_key;

       alnum_key       
-----------------------
 {AAA,0000000002,bob}
 {AAA,0000000002,fred}
 {AAA,0000000010,fred}
 {BBB,0000000000,adam}

Worked on this for almost an hour and posted without looking -- I see Erwin arrived at a similar place. Ran into the same "could not find array type for data type text[]" trouble as @Clodoaldo. Had a lot of trouble getting the cleanup exercise to not agg all the rows until I thought of grouping by the ctid (which feels like cheating really -- and doesn't work on a psuedo table as in the OP example WITH dat AS ( VALUES ('AAA2fred'), ('AAA10bob') ) ...). It would be nicer if array_agg could accept a set-producing subselect.

dbenhur
  • 20,008
  • 4
  • 48
  • 45
  • Thanks for taking a look. It's an interesting problem, and I agree that the array_agg solution is a challenge. – Craig Ringer Oct 19 '12 at 02:28
  • +1 This is nice. There's the one remaining issue that the maximum number of digits is hard-coded. Then again, `integer` errors out after 10 digits, too. But you can just move on to `bigint` or `numeric` with that. – Erwin Brandstetter Oct 19 '12 at 03:23
  • 1
    Not sure why the array_agg here - you can do this with just regexps - see my answer below. – Richard Huxton Oct 19 '12 at 09:47
  • @RichardHuxton why? well, your regexp_replace approach is clever, but I find it more obscure to read and I'll bet you a few bucks it's computationally more expensive as well. – dbenhur Oct 19 '12 at 21:15
2

I'm not a RegEx guru, but I can work it to some extent. Enough to produce this answer.

It will handle up to 2 numeric values within the content. I don't think OSX goes further than that, if it even handles 2.

WITH parted AS (
  select data,
         substring(data from '([A-Za-z]+).*') part1,
         substring('a'||data from '[A-Za-z]+([0-9]+).*') part2,
         substring('a'||data from '[A-Za-z]+[0-9]+([A-Za-z]+).*') part3,
         substring('a'||data from '[A-Za-z]+[0-9]+[A-Za-z]+([0-9]+).*') part4
    from alnum
)
  select data
    from parted
order by part1,
         cast(part2 as int),
         part3,
         cast(part4 as int),
         data;

SQLFiddle

RichardTheKiwi
  • 105,798
  • 26
  • 196
  • 262
  • Yep, the challenge is to generalize it to strings of arbitrary numbers of word-number alternations. For any given fixed number it's possible to do by writing individual splitting operations, albeit inefficiently. Thanks, though. – Craig Ringer Oct 19 '12 at 00:16
  • I was going to go for "any number of alternations" using regexp_matches and it will produce them in rows which can be concatenated back for sorting. If regexp_matches produced 2 columns, one with the sequence number, this would be a lot easier. – RichardTheKiwi Oct 19 '12 at 00:18
  • Ooh, that's an interesting idea. Use a recursive CTE based on matching of `(\w*)(\d+)(.*)` on the remainder of the last match. – Craig Ringer Oct 19 '12 at 00:20
  • The other thing about generalising it is that you'll be building a solution waiting for a problem. I can't imagine a use case for having to sort on a 7-part alpha-numeric alternating string. – RichardTheKiwi Oct 19 '12 at 00:20
  • Nor can I, frankly, I'm posting this because (a) it's an interesting problem and (b) it'll help Sivaram out by directing attention and less generalized solutions. – Craig Ringer Oct 19 '12 at 00:23
  • `it's an interesting problem` Smells like Code Golf and a cue for `"Close Question"` – RichardTheKiwi Oct 19 '12 at 00:28
  • Fair point; it's certainly borderline. OTOH, I do have an application for the sort, and I'm reluctant to go for one that just craps out after an arbitrary number of alternations. – Craig Ringer Oct 19 '12 at 00:33
1

The following solution is a combination of various ideas presented in other answers, as well as some ideas from the classic solution:

create function natsort(s text) returns text immutable language sql as $$
  select string_agg(r[1] || E'\x01' || lpad(r[2], 20, '0'), '')
  from regexp_matches(s, '(\D*)(\d*)', 'g') r;
$$;

The design goals of this function were simplicity and pure string operations (no custom types and no arrays), so it can easily be used as a drop-in solution, and is trivial to be indexed over.

Note: If you expect numbers with more than 20 digits, you'll have to replace the hard-coded maximum length 20 in the function with a suitable larger length. Note that this will directly affect the length of the resulting strings, so don't make that value larger than needed.

vog
  • 23,517
  • 11
  • 59
  • 75