Let's simplify the query in your question to the component parts you're asking about. Once we see how they work individually, perhaps the whole query will make more sense.
To start, the ~
(tilde) is the POSIX, case-sensitive regular expression operator. The linked PostgreSQL documentation provides brief descriptions and usage examples of it and its sibling operators:
Operator |
Description |
Example |
~ |
Matches regular expression, case sensitive |
'thomas' ~ '.*thomas.*' |
~* |
Matches regular expression, case insensitive |
'thomas' ~* '.*Thomas.*' |
!~ |
Does not match regular expression, case sensitive |
'thomas' !~ '.*Thomas.*' |
!~* |
Does not match regular expression, case insensitive |
'thomas' !~* '.*vadim.*' |
We can see that each operator has two operands: a constant string on the left, and a pattern on the right. If the string on the left is a match for the pattern on the right, the statement is true
, otherwise it is false
.
In the given example for the operator you're asking about, 'thomas'
is a match for the pattern '.*thomas.*'
by standard regular expression rules. The '.*'
pre-and-postfixes mean "match any character (except newline) any number of times (zero or more)". The whole pattern then means, "match any character any number of times, then the literal string 'thomas', then any character any number of times". One such match would be 'john thomas jones'
where 'john '
matches the first '.*'
and ' jones'
matches the second '.*'
.
I don't think this is a great example because it is functionally equivalent to 'thomas' LIKE '%thomas%'
which is likely to run faster, among other benefits like being a SQL-standard operator.
A better example is the query in your question where the pattern '^[0-9]'
is used. Setting aside the ^
for now, this pattern means, "match any character in 0-9
(0, 1, 2, ..., 8, 9)", which would be much more verbose if you were to use the LIKE
operator: field LIKE '^0' OR field LIKE '^1' OR field LIKE '^2' ...
.
The ^
operator is not PostgreSQL-specific. Rather it is a special character in regular expressions with one of two meanings (aside from its use as a literal character; more about that in this answer):
- The match should begin at the start of the line/string.
For example, the string "Hello, World!" would contain a match for the pattern 'World'
since the word "World" appears in it, but would not contain a match for the pattern '^World'
since the word "World" is not at the start of the string.
The string "Hello, World!" would contain a match for both of the following patterns: 'Hello'
and '^Hello'
since the word "Hello" is at the start of the string.
- The given character set should be negated when making a match.
For example, the pattern [^0-9]
means, "match any character that is not in the range 0-9". So 'a'
would match, '&'
would match, and 'G'
would match, but '7'
would not match since it is in the character set that is being excluded.
The query in your question uses the first of the two meanings. The pattern '^[0-9]'
means, "match any character in the range 0-9 starting at the beginning of the string". So '0123'
would match since the string starts with "0", but 'a5'
would not match since the string starts with "a" which is not the character set that is being matched.
Back to the query in your question, then. The relevant part reads:
1 count(distinct
2 case
3 when left(business_address, 1) ~ '^[0-9]'
4 then lower(split_part(business_address, ' ', 2))
5 else lower(split_part(business_address, ' ', 1))
6 end
7 ) as n_street
Line 3 contains a regular expression match that will determine if we should use this case in the overall CASE
statement. If the string matches the pattern, the expression will be true
and we will use this case. If the string does not match the pattern, the expression will be false
and we will try the next case.
The string we are matching to the pattern is left(business_address, 1)
. The LEFT
function takes the first n
characters from the string. Since n
is "1" here, this returns the first character of the field business_address
.
The pattern we are trying to match this string to is '^[0-9]'
which we have already said means, "match any character in the range 0-9 starting at the beginning of the string". Technically we don't need the ^
regex operator here since LEFT(..., 1)
will return at most one character (which will always be the first character in the resulting string).
As an example, if business_address
is "123 Jones Street, Anytown, USA", then LEFT(business_address, 1)
will return "1" which will match the pattern (and therefore the expression will be true
and we will use the first case).
If, instead, business_address
were "Jones Plaza, Suite 123, Anytown, USA", then LEFT(business_address, 1)
would return "J" which would not match the pattern (since the first character is "J" which is not in the range 0-9). Our expression would be false
and we would continue to the next case.