15

I am sorting songs in SQLite (on Android). I want to order them:

  1. Case-insensitive
  2. With leading-digits at the end, by integer value.
  3. Without punctuation (e.g. parentheses, periods, hyphens, apostrophes)

I have 1 & 2 working (see below). However, I can't figure out how to replace every character (other than letters, numbers, and spaces) other than to call replace() for each character.

Is there a way to do this other than ~32 calls to replace()?
(ASCII values 33-47,58-64,91-96,123-126)


Here is a test table. The value 'n' should ideally come out in order. (No, you cannot order by n ;)

create table songs (n integer, name text);
insert into songs (n,name) values (6,'I''ll Be That Girl');
insert into songs (n,name) values (24,'1969');
insert into songs (n,name) values (9,'La Moldau');
insert into songs (n,name) values (20,'Pule');
insert into songs (n,name) values (7,'I''m a Rainbow Too');
insert into songs (n,name) values (21,'5 Years');
insert into songs (n,name) values (18,'Pressure');
insert into songs (n,name) values (13,'Lagan');
insert into songs (n,name) values (1,'any old wind that blows');
insert into songs (n,name) values (17,'Poles Apart');
insert into songs (n,name) values (8,'Imagine');
insert into songs (n,name) values (14,'Last Stop before Heaven');
insert into songs (n,name) values (3,'I Before E Except After C');
insert into songs (n,name) values (4,'i do, i do, i do');
insert into songs (n,name) values (22,'99 Luftballons');
insert into songs (n,name) values (12,'L''accord parfait');
insert into songs (n,name) values (15,'Pluto');
insert into songs (n,name) values (19,'The Promise');
insert into songs (n,name) values (2,'(Don''t Fear) The Reaper');
insert into songs (n,name) values (10,'L.A. Nights');
insert into songs (n,name) values (23,'911 is a Joke');
insert into songs (n,name) values (5,'Ichthyosaurs Are Awesome');
insert into songs (n,name) values (11,'Labradors are Lovely');
insert into songs (n,name) values (16,'P.O.D.-Boom');

Here's the solution to just 1 & 2 above:

SELECT n
FROM songs
ORDER BY
  CASE WHEN name GLOB '[0-9]*' THEN 1
       ELSE 0
  END,
  CASE WHEN name GLOB '[0-9]*' THEN CAST(name AS INT)
       ELSE name
  END
COLLATE NOCASE

For this test set it produces results in this order: 2,1,3,4,6,7,5,8,12,10,9,11,13,14,16,15,17,18,20,19,21,22,23,24

I can fix this particular test set with manual replaces for each undesired character:

SELECT n
FROM songs
ORDER BY
  CASE WHEN name GLOB '[0-9]*' THEN 1
       ELSE 0
  END,
  CASE WHEN name GLOB '[0-9]*' THEN CAST(name AS INT)
       ELSE
         replace(
           replace(
             replace(
               replace(name,'.',''),
               '(',''
             ),
             '''',''
           ),
           '  ',' '
         )
  END
COLLATE NOCASE
Phrogz
  • 296,393
  • 112
  • 651
  • 745
  • In case it helps, I can rely on SQLite 3.8.6 or better. Currently I'm only targeting Android L, and will soon be targeting Android M. – Phrogz Sep 17 '15 at 17:54
  • @Phrogz..why don't you try taking the difference of the original string length and the length of string with punctuation replaced by `''`(empty string) and `order` it by that difference for no.3 in the question? – Vamsi Prabhala Sep 17 '15 at 20:13
  • i don't think you got me.this is what i was trying to convey. for a string 'ab,c' it would be len(orig_string) is 4 and len(orig_string_with_punctuation replaced by '') is 3. So this way you get the diff as `1`. And for strings without punctuation this diff would be 0. So you could use these differences in the `order by` clause. hope you get me. – Vamsi Prabhala Sep 17 '15 at 20:24
  • @vkp I understand you, but I don't understand how this would help. Perhaps I wrote requirement #3 badly; you may be misunderstanding what I want. As shown above, I want `L'accord` to sort between `Labradors` and `Lagan`. How would having `diff:1` for the first and `diff:0` for the others help that need? – Phrogz Sep 17 '15 at 20:26
  • i see..i misread the question. – Vamsi Prabhala Sep 17 '15 at 20:32
  • 5
    When you store your data, make a cleaned sort-key. Then sort by that on extraction of data. That will make your life so much simpler. – Allan S. Hansen Jan 15 '16 at 07:01
  • I know this is a workaround, but consider adding that `n` column with gaps that suit your need so that you don't need to recalculate existing values when inserting new one. – Kamil Gosciminski Jan 15 '16 at 09:28
  • Are your sure that you need this sorting to be done in sqlite / SQL itself? It can be done much easier in nearly all higher level programming languages or by extending SQL with [handwritten functions](https://www.sqlite.org/c3ref/create_function.html). – jofel Jan 15 '16 at 11:17
  • 1
    @AllanS.Hansen This is a sqlite query on data extracted live by the media player from the mp3 id3 tags. The database cannot be modified. – Phrogz Jan 15 '16 at 14:01
  • @jofel A reasonable comment in general. No, it does not have to be in sqlite. I hoped you do it there for complex reasons. Add for custom functions, see http://stackoverflow.com/a/8283265 – Phrogz Jan 15 '16 at 14:10

5 Answers5

5

I would add an additional column in the table, called "SortingName" or something. Calculate this value when inserting, ideally not in SQL but in a higher level language where you have all these nice string operations.

I didn't really understand this thing with the number. I guess the simplest thing you can do is extract the number before insert and put it into another column, like "SortingNumber".

Then simply sort like this:

Order By
  SortingName,
  SortingNumber

(Or the other way around.)

Another advantage is performance. You usually read data much more often then you write it. You can even create indexes on these two sorting columns, which is usually not possible if you calculate it in the query.

Stefan Steinegger
  • 63,782
  • 15
  • 129
  • 193
  • 1
    Sorry, the database is read-only from my perspective. The table is generated dynamically from the metadata of songs found. It's possible that I may be able to add a second table with this information, but then I need to worry about the tables getting out of sync as new songs become available or disappear. – Phrogz Jan 15 '16 at 14:04
4

The first solution (when DB and application can be modified):

Add to your table single column e.g. solumntForSorting. Then on your application before inserting, concatenate your second condition ("With leading-digits at the end, by integer value.") as 0 or 1 to song name which first 'was cleaned' from undesired symbols. So on solumntForSorting you will get something like this: 0Im a Rainbow Too and 1911 is a Joke.

The second solution (when only application can be modified):

If you have to sort data excluding some symbols and you are not allowed to change your DB, you will get a slower selection because of filtering undesired values. Most of the overhead will be at CPU time and memory.

Using replace function is tedious from my point of view, that is why I suggest using CTE with list of values you want to drop, like this ('.', '.', ';', '(', ')', '''', '-'). CTE will be bulky like multiple replace but it is easier to modify and maintain.

Try this solution:

 WITH RECURSIVE 
 ordering_name_substr(len, name, subsstr, hex_subsstr, number) 
 AS (SELECT  length(name), name, substr(name, 1, 1), hex(substr(name, 1, 1)), 1  
       FROM songs
      UNION ALL 
     SELECT len, name, substr(name, number + 1, 1),
            hex(substr(name, number + 1, 1)), number + 1
       FROM ordering_name_substr WHERE number < len),
 last_order_cretaria(value, old_name)
  AS (select GROUP_CONCAT(subsstr, ''), name 
           from ordering_name_substr 
        where hex_subsstr not in
       ('28', '29', '2C', '2E', '27') group by name )

SELECT S.n, S.name
FROM songs AS S LEFT JOIN last_order_cretaria AS OC
ON S.name = OC.old_name
ORDER BY
  CASE WHEN name GLOB '[0-9]*' THEN 1
       ELSE 0
  END,
  CASE WHEN name GLOB '[0-9]*' THEN CAST(name AS INT)
       ELSE
         OC.value
  END
COLLATE NOCASE

I have tested on sqlfiddle.

In the list ('28', '29', '2C', '2E', '27') you have values of ASCII codes (in hex) which you want to escape from be considered in ordering.

You also can try to use values itself like: ('.', '.', ';', '(', ')', '''', '-').

WITH RECURSIVE 
 ordering_name_substr(len, name, subsstr, number) 
 AS (SELECT length(name), name, substr(name, 1, 1), 1  
       FROM songs
      UNION ALL 
     SELECT len, name, substr(name, number + 1, 1),
            number + 1
       FROM ordering_name_substr WHERE number < len),
 last_order_cretaria(value, old_name)
  AS (select GROUP_CONCAT(subsstr, ''), name 
           from ordering_name_substr 
        where subsstr not in
       ('.', '.', ';', '(', ')', '''', '-') group by name )

SELECT S.n, S.name
FROM songs AS S LEFT JOIN last_order_cretaria AS OC
ON S.name = OC.old_name
ORDER BY
  CASE WHEN name GLOB '[0-9]*' THEN 1
       ELSE 0
  END,
  CASE WHEN name GLOB '[0-9]*' THEN CAST(name AS INT)
       ELSE
         OC.value
  END
COLLATE NOCASE

To make this sorting work fast and simple you have to be able to change your DB and application.

Mikhailov Valentin
  • 1,092
  • 3
  • 16
  • 23
3

In my opinion, the highest performance approach is to create a trigger to fill a new field named sort_key. You will need a primary key.

CREATE TABLE songs (n INTEGER, name TEXT, 
                    sort_key TEXT, 
                    ID INTEGER PRIMARY KEY AUTOINCREMENT);

CREATE TRIGGER songs_key_trigger
    AFTER INSERT ON songs FOR EACH ROW
    BEGIN n
        Declare @sort_key as varchar(255)
        -- calculate and call here your slugify function
        -- to fill sort_key from 'new.n' and 'new.name'
        UPDATE songs 
          SET sort_key = @sort_key
          WHERE ID = new.ID;
    END

Realize that this approach is index friendly, you can create an index over new column to avoid table full scan operations.

dani herrera
  • 48,760
  • 8
  • 117
  • 177
2

If you're allowed to create functions, this is what I'd create (taken from How to strip all non-alphabetic characters from string in SQL Server? and modified a bit):

Create Function [dbo].[RemoveNonAlphaNumericCharacters](@Temp VarChar(1000))
Returns VarChar(1000)
AS
Begin

    Declare @KeepValues as varchar(50)
    Set @KeepValues = '%[^a-zA-Z0-9\s]%'
    While PatIndex(@KeepValues, @Temp) > 0
        Set @Temp = Stuff(@Temp, PatIndex(@KeepValues, @Temp), 1, '')

    Return @Temp
End

This would meet your #3 requirement and strip all the junk out of your string, then your query would look like this:

SELECT n
FROM songs
ORDER BY
  CASE WHEN [dbo].[RemoveNonAlphaNumericCharacters](name) GLOB '[0-9]*' THEN 1
       ELSE 0
  END,
  CASE WHEN [dbo].[RemoveNonAlphaNumericCharacters](name) GLOB '[0-9]*' THEN CAST(name AS INT)
       ELSE [dbo].[RemoveNonAlphaNumericCharacters](name)
  END
COLLATE NOCASE

It doesn't look pretty and might not have best performance. I'd probably do, what Stefan suggested. Parse your song names and insert trimmed ones into a separate column just for ordering (And of course have index on that column). It should be best solution.

Community
  • 1
  • 1
Evaldas Buinauskas
  • 13,739
  • 11
  • 55
  • 107
  • Thanks for the idea. I don't think this is possible since this isn't my database, but one provided by the media manager. http://stackoverflow.com/a/8283265 – Phrogz Jan 15 '16 at 14:08
  • @Phrogz I should've been more careful when looking at tagged products. I guess there no other real way, rather than stripping all the junk out manually with 30+ replaces. – Evaldas Buinauskas Jan 15 '16 at 14:33
2

You can use the sqlite3 Android NDK Bindings to gain access to the full sqlite3 c API by using JNI calls.

Then you can Define New Collating Sequences by using the sqlite3_create_collation_v2() and related functions.

This approach does not change the database, as the collation is only overridden on the current database connection. So it satisfies that requirement in that it works if the database is read-only.

Notice I say you can. I'm not saying you SHOULD! Weigh the pros and cons of this approach as in most cases it is probably not worth the extra effort.

user700390
  • 2,287
  • 1
  • 19
  • 26