7

I have a table with book titles and I want to select books that have title matching a regexp and to order results by the position of the regexp match in title.

It's easy for a single-word searches. E.g.

TABLE book
id   title
1    The Sun
2    The Dead Sun
3    Sun Kissed

I'm going to put .* between words in client's search term before sending query to DB, so I'd write SQL with prepared regexps here.

SELECT book.id, book.title FROM book
    WHERE book.title ~* '.*sun.*'
    ORDER BY COALESCE(NULLIF(position('sun' in book.title), 0), 999999) ASC;

RESULT
id   title
3    Sun Kissed
1    The Sun
2    The Dead Sun

But if search term has more than one word I want to match titles that have all words from search term with anything between them, and sort by the position like before, so I need a function that returns a position of regexp, I didn't find an appropriate one in official PostgreSQL docs.

TABLE books
id   title
4    Deep Space Endeavor
5    Star Trek: Deep Space Nine: The Never Ending Sacrifice
6    Deep Black: Space Espionage and National Security

SELECT book.id, book.title FROM book
    WHERE book.title ~* '.*deep.*space.*'
    ORDER BY ???REGEXP_POSITION_FUNCTION???('.*deep.*space.*' in book.title);

DESIRED RESULT
id   title
4    Deep Space Endeavor
6    Deep Black: Space Espionage and National Security
5    Star Trek: Deep Space Nine: The Never Ending Sacrifice

I didn't find any function similar to ???REGEXP_POSITION_FUNCTION???, do you have any ideas?

wobmene
  • 1,108
  • 9
  • 14
  • Why is row 4 before row 6 in you last result? Both match at position 0. Arbitrary? Or are there more criteria? – Erwin Brandstetter Jan 14 '14 at 01:23
  • @ErwinBrandstetter Ideally I'd like longer (more characters) matches to have less "rank", but for now it's not the main problem. – wobmene Jan 14 '14 at 01:30

2 Answers2

4

One way (of many) to do this: Remove the rest of the string beginning at the match and measure the length of the truncated string:

SELECT id, title
FROM   book
WHERE  title ILIKE '%deep%space%'
ORDER  BY length(regexp_replace(title, 'deep.*space.*', '','i'));

Using ILIKE in the WHERE clause, since that is typically faster (and does the same here).
Also note the fourth parameter to the regexp_replace() function ('i'), to make it case insensitive.

Alternatives

As per request in the comment.
At the same time demonstrating how to sort matches first (and NULLS LAST).

SELECT id, title
      ,substring(title FROM '(?i)(^.*)deep.*space.*') AS sub1
      ,length(substring(title FROM '(?i)(^.*)deep.*space.*')) AS pos1

      ,substring(title FROM '(?i)^.*(?=deep.*space.*)') AS sub2
      ,length(substring(title FROM '(?i)^.*(?=deep.*space.*)')) AS pos2

      ,substring(title FROM '(?i)^.*(deep.*space.*)') AS sub3
      ,position((substring(title FROM '(?i)^.*(deep.*space.*)')) IN title) AS p3

      ,regexp_replace(title, 'deep.*space.*', '','i') AS reg4
      ,length(regexp_replace(title, 'deep.*space.*', '','i')) AS pos4
FROM   book
ORDER  BY title ILIKE '%deep%space%' DESC NULLS LAST
         ,length(regexp_replace(title, 'deep.*space.*', '','i'));

You can find documentation for all of the above in the manual here and here.

-> SQLfiddle demonstrating all.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Nice! Really simple and good solution. You say there can be many other solutions, could you please show me any? I didn't see this solution and can't imagine others even now :) – wobmene Jan 14 '14 at 01:33
  • How can I remove WHERE clause and have the same sorting? I mean I want to get all titles but titles having no match at the end? With this length() I will get very short titles having no match first. – wobmene Jan 14 '14 at 01:44
  • I just edited the question because I've found that I don't have ability to create procedural functions at all (thanks heroku). – wobmene Jan 14 '14 at 01:50
  • `ORDER BY title ILIKE '%deep%space%' DESC, length(regexp_replace(title, 'deep.*space.*', '','i'))` looks like a solution – wobmene Jan 14 '14 at 01:52
  • @khrf: Yes to your last comment - solution for sorting. I would add `NULLS LAST` if your column isn't defined `NOT NULL`. And I added a few alternatives .. – Erwin Brandstetter Jan 14 '14 at 02:03
  • @khrf: If that solved your problem, please accept the answer. – Erwin Brandstetter Jan 16 '14 at 02:10
  • 1
    Yes, your answer helped a lot. Finally I'm taking the length of string before match and the length of the match, and ordering based on this two values, Thank You very much for inspiration! – wobmene Jan 17 '14 at 20:06
2

Another way to do this would be to first get the literal match for the pattern, then find the position of the literal match:

strpos(input, (regexp_match(input, pattern, 'i'))[1]);

Or in this case:

SELECT   id, title
FROM     book
ORDER BY strpos(book.title, (regexp_match(book.title, '.*deep.*space.*', 'i'))[1]);

However, there are few caveats:

  1. this is not very efficient as it will scan the input string twice.

  2. this will ignore lookaround (lookbehind, lookahead) constraints, since the literal match can appear multiple times, before the pattern match. e.g: for the input 'aba' and pattern '(?<=b)a', strpos will return 1 (for the 1st 'a') although the actual position should be 3 (for the 2nd 'a').

BTW, you should probably use a greedy quantifier and narrow your character class as much as you can instead of .* to increase performance (e.g 'deep [\w\s]*? space')

gilad905
  • 2,842
  • 2
  • 16
  • 23