1

For my database, I have a list of company numbers where some of them start with two letters. I have created a regex which should eliminate these from a query and according to my tests, it should. But when executed, the result still contains the numbers with letters.

Here is my regex, which I've tested on https://www.regexpal.com

([^A-Z+|a-z+].*)

I've tested it against numerous variations such as SC08093, ZC000191 and NI232312 which shouldn't match and don't in the tests, which is fine.

My sql query looks like;

SELECT companyNumber FROM company_data 
WHERE companyNumber ~ '([^A-Z+|a-z+].*)' order by companyNumber desc

To summerise, strings like SC08093 should not match as they start with letters.

I've read through the documentation for postgres but I couldn't seem to find anything regarding this. I'm not sure what I'm missing here. Thanks.

Kieran Dee
  • 129
  • 9

3 Answers3

4

The ~ '([^A-Z+|a-z+].*)' does not work because this is a [^A-Z+|a-z+].* regex matching operation that returns true even upon a partial match (regex matching operation does not require full string match, and thus the pattern can match anywhere in the string). [^A-Z+|a-z+].* matches a letter from A to Z, +,|or a letter fromatoz`, and then any amount of any zero or more chars, anywhere inside a string.

You may use

WHERE companyNumber NOT SIMILAR TO '[A-Za-z]{2}%'

See the online demo

Here, NOT SIMILAR TO returns the inverse result of the SIMILAR TO operation. This SIMILAR TO operator accepts patterns that are almost regex patterns, but are also like regular wildcard patterns. NOT SIMILAR TO '[A-Za-z]{2}%' means all records that start with two ASCII letters ([A-Za-z]{2}) and having anything after (%) are NOT returned and all others will be returned. Note that SIMILAR TO requires a full string match, same as LIKE.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you, that worked. If you can, could you perhaps explain why that did work and my regex didn't perhaps? – Kieran Dee Jan 21 '19 at 11:52
  • @KieranDee I hope I added enough details. I think `SIMILAR TO` is the best fit for your task. If your conditions become more specific, you may consider moving to the `~` operator. Then, do not forget about `^` to mark the start of string and in case you need that, `$` as the end of string anchor. – Wiktor Stribiżew Jan 21 '19 at 11:58
  • Thanks for your explanations. So even if any part of the string matches the regex, postgres will return it basically. Meaning I have to match the entire string with the regex rather than part of it, at least for my regex. – Kieran Dee Jan 21 '19 at 12:02
2

Not start with a letter could be done with

WHERE company ~ '^[^A-Za-z].*'

demo: db<>fiddle

The first ^ marks the beginning. The [^A-Za-z] says "no letter" (including small and capital letters).


Edit: Changed [A-z] into the more precise [A-Za-z] (Why is this regex allowing a caret?)

S-Man
  • 22,521
  • 7
  • 40
  • 63
  • `[^A-z]` does not actually mean no letter. It means ``no letter, [, \, ], ^, _ and ` chars``. See [the `[A-z]` char range](https://stackoverflow.com/a/29771926/3832970). – Wiktor Stribiżew Jan 21 '19 at 12:06
  • @WiktorStribiżew yes of course you're right. In that case the other solution is wrong as well because it starts always with a digit. Thats when the TO no exactly defines its use case. – S-Man Jan 21 '19 at 12:40
2

Your pattern: [^A-Z+|a-z+].* means "a string where at least some characters are not A-Z" - to extend that to the whole string you would need to use an anchored regex as shown by S-Man (the group defined with (..) isn't really necessary btw)

I would probably use a regex that specifies want the valid pattern is and then use !~ instead.

where company !~ '^[0-9].*$'

^[0-9].*$ means "only consists of numbers" and the !~ means "does not match"

or

where not (company ~ '^[0-9].*$')