2

I need to sort a table of apartment numbers in SQL. For those unfamiliar, these are not necessarily numbers, but numerical ordering needs to be applied as though they were.

For example, a possible set of apartment numbers would be ordered as-follows:

1
10
101
101-A
1000
200
200A
2000
C
D
E-100
F

Doing ORDER BY CONVERT( int, ApartmentNumber ) wouldn't work because not all apartment-numbers are string-encoded decimal integers. Similarly doing ORDER BY ApartmentNumber wouldn't work because it would place 101 after 1000.

The other QAs on StackOverflow generally concern themselves with either a known fixed-format of value Sorting string column containing numbers in SQL? or doing the sort with error-handling: Sorting field by numerical value and lexicographical value

In my own project, a previous developer used this trick:

ORDER BY
    RIGHT('00000000000000000000' + RTRIM( ApartmentNumber ), 20 ) ASC

...which feels even worse. Thing is: this trick is algorithmically sound - it just feels like a hack to have to have the database engine perform string allocations (multiple times too, if they don't optimize the concatenation and RTRIM into a single string operation).

Another approach proffered on SO is:

ORDER BY
    LEN( ApartmentNumber ),
    ApartmentNumber

...however this yields this incorrect ordering of the input set:

1
C
D
F
10
101
200
1000
2000
200A
101-A
E-100

I'm using SQL Server 2012 locally (in 2008 (Level 100) compatibility mode) and this application will be deployed to an Azure SQL server (Azure SQL V12).

UPDATE:

I've been considering some options, I think the "best" is to use a fixed-length char(10) field instead of varchar(10) and to ensure that the content of the field is always left-aligned, that way the sort-order will be ensured with a simple ORDER BY ApartmentNumber.

UPDATE 2: I realised the above idea (char(10)) doesn't solve the problem of 200A needing to be sorted before 2000. I think the best solution would be to normalize the values to a uniform integer representation and storing that int value in the database. The algorithm to do the conversion would be best (i.e. most succinctly) if it were not written in SQL.

Community
  • 1
  • 1
Dai
  • 141,631
  • 28
  • 261
  • 374
  • 2
    which dbms are you using? – Vamsi Prabhala Aug 22 '16 at 00:07
  • @vkp I was hoping there might be a good cross-platform solution, but I'm targeting SQL Server 2012 and SQL Azure V12. – Dai Aug 22 '16 at 00:09
  • Perhaps you need to come up with a set of masks that match up with all the various formats. – shawnt00 Aug 22 '16 at 00:13
  • @shawnt00 Looking at the text - I see what it should really do is identify the first numeric value in the string, sort by that, and then do a lexicographical sort. – Dai Aug 22 '16 at 00:33
  • Given you're using SQL Server, are you able to use CLR stored procedures? – jdphenix Aug 22 '16 at 00:34
  • What about E101 and E1000? I think you'd have to check for both leading and trailing numeric sections. That's why I figured it would be easier to just list out the expected formats. – shawnt00 Aug 22 '16 at 00:35
  • @shawnt00 I think the most performant way would be to pre-process the ApartmentNumber to some kind of normalized integer representation (e.g. as a base-36 number - ignoring non-alphanumeric chars) and store that in the DB, and sort by that in future - at the cost of needing an additional integer column. – Dai Aug 22 '16 at 00:37
  • The first solution you mention (pre pad with zeros) is not a hack, it's exactly how I would do it and have done it in the past. Don't concern yourself worrying about how the DB does it's string operations. Add a computed column that works this out beforehand and use it. Only thing is - will it work for 200 and 200A? I don't think so – Nick.Mc Aug 22 '16 at 00:39

1 Answers1

0

I don't think my original idea of using simple masking patterns was going to work well and I ended up reducing the problem to these four expressions. Hopefully you can find some value in it all.

order by
    case
         when patindex('%[0-9][A-Z-]%', apt) > 0
           then cast(substring(apt, 1, patindex('%[0-9][A-Z-]%', apt)) as int)
         when patindex('[0-9]%', apt) = 1
           then cast(apt as int)
         else 99999
    end /* numeric_prefix */,
    case
         when patindex('%[A-Z][0-9-]%', apt) > 0
           then left(apt, patindex('%[A-Z][0-9-]%', apt))
         when patindex('[A-Z]%', apt) = 1
           then apt
         else ''
    end /* char_prefix */,
    case 
         when patindex('%[A-Z-][0-9]%', apt) > 0
           then cast(substring(apt, patindex('%[A-Z-][0-9]%', apt) + 1, 10) as int)
         else 0
    end /* numeric_suffix */,
    replace(apt, '-', '') /* stripped_hyphen */,
    case when charindex('-', apt) = 0 then 0 else 1 end /* sort_hyphen_last */

http://rextester.com/MFYVN38156

Here's an single expression that you could store and use for sorting:

case
     when patindex('%[0-9][A-Z-]%', apt) > 0
     then right('*****' + left(apt, patindex('%[0-9][A-Z-]%', apt)), 5) +
          right('>>>>>' +
              replace(substring(apt, patindex('%[0-9][A-Z-]%', apt) + 1, 10), '-', ''), 5)
     when patindex('%[A-Z][0-9-]%', apt) > 0
     then right('>>>>>' + left(apt, patindex('%[A-Z][0-9-]%', apt)), 5) +
          right('*****' +
              replace(substring(apt, patindex('%[A-Z][0-9-]%', apt) + 1, 10), '-', ''), 5)
     when patindex('[0-9]%', apt) = 1 then right('*****' + apt, 5) + '>>>>>'
     when patindex('[A-Z]%', apt) = 1 then right('>>>>>' + apt, 5) + '*****'
     else '>>>>>*****'
end +
case when charindex('-', apt) = 0 then '' else '-' end /* collate Latin1_General_BIN */

> was chosen as a character that sorts between the digits and alphabetic characters. * falls before the digits. The output looks like this:

   apt     sort_string  
 -------- ------------- 
  1        ****1>>>>>   
  10       ***10>>>>>   
  101      **101>>>>>   
  101-A    **101>>>>A-  
  200      **200>>>>>   
  200A     **200>>>>A   
  200-A    **200>>>>A-  
  1000     *1000>>>>>   
  2000     *2000>>>>>   
  C        >>>>C*****   
  D        >>>>D*****   
  E        >>>>E*****   
  E10      >>>>E***10   
  E100     >>>>E**100   
  E-100    >>>>E**100-  
  E101     >>>>E**101   
  E-101    >>>>E**101-  
  E200     >>>>E**200   
  E-200    >>>>E**200-  
  E1000    >>>>E*1000   
  E-1001   >>>>E*1001-  
  F        >>>>F*****
shawnt00
  • 16,443
  • 3
  • 17
  • 22
  • I'm curious what the runtime performance of this is. I note that if the individual `WHEN` cases are evaluated sequentially (instead of in-parallel) then they could be rearranged into a tree to avoid unnecessary evaluation. – Dai Aug 22 '16 at 00:55
  • How often are you going to be running this whole operation? Is performance really something you need to worry much about here? – shawnt00 Aug 22 '16 at 02:02
  • This solution doesn't place negative numbers before positive numbers. This could be fixed by replacing the final line with `case when charindex('-', apt) = 0 then '+' else '-' end /* collate Latin1_General_BIN */` – WonderWorker Jun 14 '17 at 14:49
  • Were we expecting negative apartment numbers? – shawnt00 Jun 14 '17 at 15:14