1

Is it possible to order the results of a PostgreSQL query by a title field that contains characters like [](),; etc but do so ignoring these punctuation characters and sorting only by the text characters?

I've read articles on changing the database collation or locale but have not found any clear instructions on how to do this on an existing database an on a per-column basis. Is this even possible?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
tufelkinder
  • 1,176
  • 1
  • 15
  • 37

2 Answers2

5

"Normalize" for sorting

You could use regexp_replace() with the pattern '[^a-zA-Z]' in the ORDER BY clause but that only recognizes pure ASCII letters. Better use the class shorthand '\W' which recognizes additional non-ASCII letters in your locale like äüóèß etc. Or you could improvise and "normalize all characters with diacritic elements to their base form with the help of the unaccent() function. Consider this little demo:

SELECT *
      , regexp_replace(x, '[^a-zA-Z]', '', 'g')
      , regexp_replace(x, '\W', '', 'g')
      , regexp_replace(unaccent(x), '\W', '', 'g')
FROM  (
SELECT 'XY ÖÜÄöüäĆČćč€ĞğīїıŁłŃńŇňŐőōŘřŠšŞşůŽžż‘´’„“”­–—[](),;.:̈� XY'::text AS x) t

->SQLfiddle for Postgres 9.2.
->SQLfiddle for Postgres 9.1.

Regular expression code has been updated in version 9.2. I am assuming this is the reason for the improved handling in 9.2 where all letter characters in the example are matched, while 9.1 only matches some.

unaccent() is provided by the additional module unaccent. Run:

CREATE EXTENSION unaccent;

once per database to use in (Postgres 9.1+, older versions use a different technique).

locales / collation

You must be aware that Postgres relies on the underlying operating system for locales (including collation). The sort order is governed by your chosen locale, or more specific LC_COLLATE. More in this related answer:
String sort order (LC_COLLATE and LC_CTYPE)

There are plans to incorporate collation support into Postgres directly, but that's not available at this time.

Many locales ignore the special characters you describe for sorting character data out of the box. If you have a locale installed in your system that provides the sort order you are looking for, you can use it ad-hoc in Postgres 9.1 or later:

SELECT foo FROM bar ORDER BY foo COLLATE "xy_XY"

To see which collations are installed and available in your current Postgres installation:

SELECT * FROM pg_collation;

Unfortunately it is not possible to define your own custom collation (yet) unless you hack the source code.

The collation rules are usually governed by the rules of a language as spoken in a country. The sort order telephone books would be in, if there were still telephone books ... Your operating system provides them.

For instance, in Debian Linux you can use:

locale -a

to display all generated locales. And:

dpkg-reconfigure locales

as root user (one way of several) to generate / install more.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • This is all good information, but I have read various articles saying that there are different collations (though this is the first time I have seen how easy it is to apply one to a query), but how do I know which collation to use in a query in order to achieve the desired result? The collation support article does not provide a list of collations and how sorting behaves under each. – tufelkinder Jul 01 '13 at 19:33
  • @tufelkinder: I added some more to my answer. – Erwin Brandstetter Jul 01 '13 at 19:46
  • I appreciate that. The query does provide a listing of the available locales, but that doesn't help me determine which locale will produce which type of sorting, unless I'm missing something? – tufelkinder Jul 03 '13 at 18:22
  • @tufelkinder: Sorting rules are complex and could hardly be described in a short comment. Every country / language combination has their own set of rules. You would typically find them in a national set of grammer rules. – Erwin Brandstetter Jul 04 '13 at 12:54
1

If you want to have this ordering in one particular query you can

ORDER BY regexp_replace(title, '[^a-zA-Z]', '', 'g')

It will delete all non A-Z from sting and order by resulting field.

Ihor Romanchenko
  • 26,995
  • 8
  • 48
  • 44
  • This is really helpful and definitely getting my results in the right direction. Now to implement it in Django... – tufelkinder Jul 01 '13 at 19:16