0

I'm trying to match a comma-separated string of numbers to a certain pattern within an sql query. I used regular expressions for similar problems in the past successfully, so I'm trying to get them working here as well. The problem is as follows:

  • The string may contain any number in a range (e.g. 1-4) exactly 0-1 times.
  • Two numbers are comma-separated
  • The numbers have to be in ascending order

(I think this is kind of a case of ordered sampling without replacement)
Sticking with the example of 1-4, the following entries should match:

1 
1,2 
1,3 
1,4 
1,2,3 
1,2,4 
1,3,4 
1,2,3,4
2 
2,3 
2,4 
3 
3,4
4

and these should not:

q dawda 323123 a3 a1 1aa,1234 4321 a4,32,1a 1112222334411 
1,,2,33,444, 11,12,a  234 2,2,3 33 3,3,3 3,34 34 123 1,4,4,4a 1,444 

The best try I currently have is:

\b[1-4][\,]?[2-4]?[\,]?[3-4]?[\,]?[4]?\b

This still has two major drawbacks:

  1. It delivers quite a lot of false positives. Numbers are not eliminated after they occurred once.
  2. It will get rather long, when the range of numbers increases, e.g. 1-18 is already possible as well, bigger ranges are thinkable of.

I used regexpal for testing purposes.

Side notes:

  • As I'm using sql it would be possible to implement some algorithm in another language to generate all the possible combinations and save them in a table that can be used for joining, see e.g. How to get all possible combinations of a list’s elements?. I would like to only rely on that as a last resort, as the creation of new tables will be involved and these will contain a lot of entries.
  • The resulting sql statement that uses the regex should run on both Postgres and Oracle.
  • The set of positive examples is also referred to as "powerset".

Edit: Clarified the list of positive examples

Thomas S.
  • 3
  • 3
  • 3
    "*should run on both Postgres and Oracle*" - this is really a showstopper. The regex support in both DBMS is very different and I doubt you can really come up with one single statement that runs without any change in both products. –  Mar 11 '19 at 09:53
  • If there is not one statement running on both Oracle and Postgres, I'm still fine with two separate statements that would work. – Thomas S. Mar 11 '19 at 10:06
  • i think you should tag a specific DBMS then you can get much better help. Like @a_horse_with_no_name also mentioned before more or less ... *"How to get all combinations (ordered sampling without replacement) in regex"* Basically you are going to need to CROSS JOIN with two the same regex filters to get a possible combinations something like ... `SELECT * FROM ( /* regex query */ ) as regex1 CROSS JOIN ( /* regex query */ ) as regex2` – Raymond Nijland Mar 11 '19 at 10:15

1 Answers1

2

I wouldn't use Regex for this, as e.g. the requirements "have to be unique" and "have to be in ascending order" can't really be expressed with a regular expression (at least I can't think of a way to do that).

As you also need to have an expression that is identical in Postgres and Oracle, I would create a function that checks such a list and then hide the DBMS specific implementation in that function.

For Postgres I would use its array handling features to implement that function:

create or replace function is_valid(p_input text)
  returns boolean
as
$$
  select coalesce(array_agg(x order by x) = string_to_array(p_input, ','), false)
  from (
    select distinct x
    from unnest(string_to_array(p_input,',')) as t(x)
    where x ~ '^[0-9]+$' -- only numbers
  ) t
  where x::int between 1 and 4 -- the cast is safe as the inner query only returns valid numbers
$$
language sql;

The inner query returns all (distinct) elements from the input list as individual numbers. The outer query then aggregates that back for values in the desired range and numeric order. If that result isn't the same as the input, the input isn't valid.

Then with the following sample data:

with sample_data (input) as (
  values 
    ('1'),
    ('1,2'),
    ('1,3'), 
    ('1,4'), 
    ('1,2,3'), 
    ('1,2,4'),
    ('foo'),
    ('1aa,1234'),
    ('1,,2,33,444,')
)
select input, is_valid(input)
from sample_data;

It will return:

input        | is_valid
-------------+---------
1            | true    
1,2          | true    
1,3          | true    
1,4          | true    
1,2,3        | true    
1,2,4        | true    
foo          | false   
1aa,1234     | false   
1,,2,33,444, | false   

If you want to use the same function in Postgres and Oracle you probably need to use returns integer in Postgres as Oracle still doesn't support a boolean data type in SQL


Oracle's string processing functions are less powerful than Postgres' functions (e.g. no string_to_array or unnest), but you can probably implement a similar logic in PL/SQL as well (albeit more complicated)

  • *"Oracle's string processing functions are less powerful than Postgres' functions (e.g. no string_to_array or unnest)"* Well Oracle 11G+ seams to have [dbms_utility.comma_to_table](https://docs.oracle.com/cd/B28359_01/appdev.111/b28419/d_util.htm#i1002468) which does the same more or less then the inner query in your answer. – Raymond Nijland Mar 11 '19 at 10:25
  • 1
    @RaymondNijland: hmm, that doesn't seem to do the same thing. "*These procedures converts a comma-delimited list of names into a PL/SQL table of names*" - the input list doesn't contain "names" it contains "values". The "split CSV string to rows" seems to be much more complicated in Oracle. See e.g. [here](https://stackoverflow.com/questions/38371989) or [here](https://stackoverflow.com/questions/14328621) –  Mar 11 '19 at 10:35
  • indeed i guess i misunderstood the Oracle documentation... i also found this [one](https://stackoverflow.com/questions/26273042/getting-comma-separated-list-near-xx-yy-invalid-with-dbms-utility-comma-to-t/26273192) – Raymond Nijland Mar 11 '19 at 10:45