PostgreSQL: .csv regex - test for repeating substrings within a string (digits)

Question

Introduction:

I have the following scenario in PostgreSQL whereby I want to perform some data validation on a .csv string prior to inserting it into a table (see the fiddle here).

I've managed to get a regex (in a CHECK constraint) which disallows spaces within strings (e.g. "12 34") and also disallows preceding zeros ("00343").

Now, the icing on the cake would be if I could use regular expressions to disallow strings which contain a repeat of an integer - i.e. if a sequence \d+ matched another \d+ within the same string.

Is this beyond the capacities of regular expressions?

My table is as follows:

CREATE TABLE test
(
  data TEXT NOT NULL,
  
  CONSTRAINT d_csv_only_ck 
    CHECK (data ~ '^([ ]*([1-9]\d*)+[ ]*)(,[ ]*([1-9]\d*)+[ ]*)*$')

);

And I can populate it as follows:

INSERT INTO test VALUES 
('992,1005,1007,992,456,456,1008'),  -- want to make this line unnacceptable - repeats!
('44,1005,1110'), 
('13,  44  ,  1005,  10078  '),  -- acceptable - spaces before and after integers   
('11,1203,6666'),
('1,11,99,2222'),
('3435'),             
('  1234    '); -- acceptable

But:

INSERT INTO test VALUES ('23432, 3433   ,00343, 567'); -- leading 0 - unnacceptable

fails (as it should), and also fails (again, as it should)

INSERT INTO test VALUES ('12  34');  -- spaces within numbers - unnacceptable

The question:

However, if you notice the first string, it has repeats of 992and 456.

I would like to be able to match these.

All of these rules do not have to be in the same regex - I can use a second CHECK constraint.

I would like to know if what I am asking is possible using Regular Expressions?

I did find this post which appears to go some (all?) of the way to solving my issue, but I'm afraid it's beyond my skillset to get it to work - I've included a small test at the bottom of the fiddle.

Please let me know should you require any further information.

p.s. as an aside, I'm not very experienced with regexes and I would welcome any input on my basic one above.

I think I understand what you are trying to do but I wonder why? Why are adding the validation at the database layer? Could you possibly have it in the code that manages the database? What kind of data is it? — Omeri, Aug 31 '21 at 10:30
@ErionOmeri - yes, it's a matching problem, but within the input string and not simply matching a given string literal that's already known with the input! — Vérace, Aug 31 '21 at 10:32
You might need to use regex “Look Ahead” to see if the string repeats. I am not sure if PostreSQL supports that in its implementation. You could create a function to make things a lot easier, but I would do this code, if at all. — Omeri, Aug 31 '21 at 10:38
PostgreSQL does have [very sophisticated](https://www.postgresql.org/docs/13/functions-matching.html) regex capabilites! Faleminderit for your input! — Vérace, Aug 31 '21 at 10:40

Wiktor Stribiżew · Accepted Answer · 2021-08-31T11:33:03.487

2

Since PostegreSQL regex does not support backreferences, you cannot apply this restriction because you would need a negative lookahead with a backreference in it.

Have a look at this PCRE regex:

^(?!.*\b(\d+)\b.*\b\1\b) *[1-9]\d* *(?:, *[1-9]\d* *)*$

See this regex demo. Details:

^ - start of string
(?!.*\b(\d+)\b.*\b\1\b) - no same two numbers as whole word allowed anywhere in the string
* - zero or more spaces
[1-9]\d* - a non-zero digit and then any zero or more digits
* - zero or more spaces
(?:, *[1-9]\d* *)* - zero or more occurrences of
- , * - comma and zero or more spaces
- [1-9]\d* - a non-zero digit and then any zero or more digits
- * - zero or more spaces
$ - end of string.

Even if you replace \b with \y (PostgreSQL regex word boundaries) in the PostgreSQL code, it won't work due to the drawback mentioned at the top of the answer.

edited Aug 31 '21 at 11:33

answered Aug 31 '21 at 10:25

Wiktor Stribiżew

607,720
39
448
563

1

Thanks for your input! I already upvoted your contribution which I reference in my question, so I hope you won't mind a direct question! If you look [here](https://regex101.com/r/fu3K60/1), your regex does what I want with INTs, it seems to "get confused" with alphabetical strings for some reason? Could you also provide a brief breakdown of what the bits do? Ideally, I'd like to know "how to fish..."! – Vérace Aug 31 '21 at 10:30
@Vérace I do not quite understand what confusion you noticed here. What do you mean? – Wiktor Stribiżew Aug 31 '21 at 10:40
456, 897, 456 - is in white - a match on 456? asfdasdf adsfas ,adfas also in white - no match there? – Vérace Aug 31 '21 at 10:41
@Vérace `456, 897, 456` should not match, there are two repeating numbers. `asfdasdf adsfas ,adfas` has no numbers and should not match. – Wiktor Stribiżew Aug 31 '21 at 10:45
Ah... OK, I was "inverting" them - if I want to reverse the "polarity" - i.e. make the white ones coloured and the coloured ones white - what do I have to do? (final question - then I'll upvote and mark as correct - and dziękuję). Trying to understand!... :-) – Vérace Aug 31 '21 at 10:54
@Vérace A very crude way to reverse a pattern that matches the whole string is to use `^(?!pattern).*` / `^(?!pattern)`. So, in your case, `^(?!(?!.*\b(\d+)\b.*\b\1\b) *[1-9]\d* *(?:, *[1-9]\d* *)*$).*` (see [demo](https://regex101.com/r/fu3K60/2)). – Wiktor Stribiżew Aug 31 '21 at 10:59
I upvoted your answer as helpful - and it does answer the question. Would you be able to look [here](https://dbfiddle.uk/?rdbms=postgres_13&fiddle=4253e80e03ed2151ed774495687ba2d4) and tell me if it's a PostgreSQL problem or is there something I'm missing? If so, I'll mark your input as correct... sorry about the messing around. I'll mark it correct anyway - it's puzzling me because normally PostgreSQL is **very** standards compliant and I'm wondering if it's a PostgreSQL problem? Thanks for any input... – Vérace Aug 31 '21 at 11:22
@Vérace Sorry, I forgot that PostegreSQL regex [does not support backreferences](https://stackoverflow.com/questions/22751376/postgresql-invalid-regular-expression-invalid-backreference-number?lq=1). I updated the answer. – Wiktor Stribiżew Aug 31 '21 at 11:30
Ah, well, thanks for your input anyway! I can do it in various other ways - would have been nice to do it in DDL - I'm a big believer in putting as much as possible into the DDL code as soon as possible - your db is your last line of defence for your data and the stronger you can make it, the better, but I did learn a bit! Jeszcze raz dzięki... – Vérace Aug 31 '21 at 12:06

PostgreSQL: .csv regex - test for repeating substrings within a string (digits)

Introduction:

The question:

1 Answers1

Linked