0

I am working with an programm which uploads emailadresses to another programm - but it accepts emails only in one way:

i tried to write a reglular expression to filter out emailadresse which are not accepted

^(?:([A-Za-z0-9!#$%*+-.=?~|`_^]{1,64})|(\"[A-Za-z0-9!#$%*+-.=?~|`_^(){}<>@,;: \[\]]{1,64}\"))\@(?!\.)(?!\-)(?!.*\.$)(?!.*\.\.)([A-Za-z0-9.-]{1,61})\.([a-z]{2,10})$

The description says: username@domain

The at sign ('@') must be present and not first or last character. The length of the name can have up to and including 64 characters. The length of the domain can have up to and including 64 characters. All email addresses are forced to lowercase when the email is sent. Therefore any email addresses requiring uppercase will most likely not be delivered correctly by the ISP as we will have changed it to lowercase. username

Can contain:

  • A-Z
  • a-z
  • 0-9
  • ! # $ % * + - . = ? ~ | ` _ ^

The entire name can be surrounded by double quotes (though this is not supported by many ISPs). In this case, the following additional characters are allowed between the quotes - ( ) { } < > @ , ; : [ ] (space)

domain

Can contain:

  • A-Z
  • a-z
  • 0-9

Cannot contain 2 or more consecutive periods Must contain at least 1 period Domain - Cannot begin or end with a period or dash

also the part with [] does not work

Thanks for your help.

Tobias G.
  • 7
  • 3
  • 1
    You cannot convert the PCRE regex above into a POSIX-like Oracle regex, you cannot use lookaheads there. – Wiktor Stribiżew Aug 30 '22 at 14:04
  • 1
    [Oracle regular expression support](https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Oracle-Regular-Expression-Support.html) doesn't include lookaheads. This seems like something you should be doing in another layer - as the address is entered into your application? If it has to be done in the DB, you could maybe split the value up and parse with various logic applied in a PL/SQL function; or perhaps more simply look at a [Java stored procedure](https://docs.oracle.com/en/database/oracle/oracle-database/19/jjdev/developing-Java-stored-procedures.html) to handle the regex? – Alex Poole Aug 30 '22 at 14:05
  • by the way - i need to get rid of "äüöß" as well – Tobias G. Aug 30 '22 at 14:20
  • @AlexPoole - yes i know that i might implement this into the input layer - but i am not able to - my only possibility is to filter them out – Tobias G. Aug 30 '22 at 14:23
  • For some fun facts on regex and email addresses see https://stackoverflow.com/q/201323/4178262 – Mark Stewart Aug 30 '22 at 14:30

2 Answers2

1

Oracle does not, natively, support non-capturing groups, look-ahead or look-behind in regular expressions.


However, if you have Java enabled in the database then you can compile a Java class:

CREATE AND COMPILE JAVA SOURCE NAMED RegexParser AS
import java.util.regex.Pattern;

public class RegexpMatch {
  public static int match(
    final String value,
    final String regex
  ){
    final Pattern pattern = Pattern.compile(regex);

    return pattern.matcher(value).matches() ? 1 : 0;
  }
}
/

And create a PL/SQL wrapper function:

CREATE FUNCTION regexp_java_match(value IN VARCHAR2, regex IN VARCHAR2) RETURN NUMBER
AS LANGUAGE JAVA NAME 'RegexpMatch.match( java.lang.String, java.lang.String ) return int';
/

and then you can use your regular expression (or any other regular expression that Java supports):

SELECT REGEXP_JAVA_MATCH(
         'alice@example.com',
         '^(?:([A-Za-z0-9!#$%*+-.=?~|`_^]{1,64})|(\"[A-Za-z0-9!#$%*+-.=?~|`_^(){}<>@,;: \[\]]{1,64}\"))\@(?!\.)(?!\-)(?!.*\.$)(?!.*\.\.)([A-Za-z0-9.-]{1,61})\.([a-z]{2,10})$'
       ) AS match
FROM   DUAL

Which outputs:

MATCH
1

db<>fiddle here

MT0
  • 143,790
  • 11
  • 59
  • 117
0

Your regular expression can be re-written into a format that Oracle supports as:

  • (?:) non-capturing group are not supported and should just be a () capturing group instead.
  • Look-ahead is not supported but you can rewrite the look-ahead patterns using character list so @(?!\.)(?!-)([A-Za-z0-9.-]{1,61}) can be rewritten as @[A-Za-z0-9][A-Za-z0-9.-]{0,60}.
  • The (?!.*\.$) look-ahead is redundant as the pattern ends with ([a-z]{2,10})$ and can never match a trailing ..
  • If you want to include ] and - in a character list then ] should be the first character and - the last in the set.
  • The only thing that cannot be implemented in an Oracle regular expression is simultaneously restricting the length of the post-@ segment and ensuring there are no .. double dots; to do that you need to check for one of those two conditions in a second regular expression.
SELECT REGEXP_SUBSTR(
         REGEXP_SUBSTR(
           'alice@example.com',
           '^('
             -- Unquoted local-part
             || '[A-Za-z0-9!#$%*+.=?~|`_^-]{1,64}'
           -- or
           || '|'
             -- Quoted local-part
             || '"[]A-Za-z0-9!#$%*+.=?~|`_^(){}<>@,;: [-]{1,64}"'
           || ')@'
           -- Domains
           || '[A-Za-z0-9]([A-Za-z0-9.-]{0,60})?'
           -- Top-level domain
           || '\.[a-z]{2,10}$'
         ),
         -- Local-part
         '^([^"]*?|".*?")'
         || '@'
         -- Domains - exclude .. patterns
         || '([^.]+\.)+[a-z]{2,10}$'
       ) AS match
FROM   DUAL

Or, using POSIX character lists:

SELECT REGEXP_SUBSTR(
         REGEXP_SUBSTR(
           'alice@example.com',
           '^('
             -- Unquoted local-part
             || '[[:alnum:]!#$%*+.=?~|`_^-]{1,64}'
           -- or
           || '|'
             -- Quoted local-part
             || '"[][:alnum:]!#$%*+.=?~|`_^(){}<>@,;: [-]{1,64}"'
           || ')@'
           -- Domains
           || '[[:alnum:]]([[:alnum:].-]{0,60})?'
           -- Top-level domain
           || '\.[[:lower:]]{2,10}$'
         ),
         -- Local-part
         '^([^"]*?|".*?")'
         || '@'
         -- Domains
         || '([^.]+\.)+[[:lower:]]{2,10}$'
       ) AS match
FROM   DUAL

Which both output:

MATCH
alice@example.com

db<>fiddle here

MT0
  • 143,790
  • 11
  • 59
  • 117
  • this works super fine - thanks but there is always a but in regex oracle the letters A-Z are containing diacritics - how to exclude them too? – Tobias G. Aug 31 '22 at 12:01
  • using ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz instead makes the query 3 times longer to execute – Tobias G. Aug 31 '22 at 12:10
  • @TobiasG. What are your `NLS_COMP` and `NLS_SORT` session parameters set to? (See https://stackoverflow.com/a/64352981/1509264) – MT0 Aug 31 '22 at 13:09
  • NLS_SORT GERMAN NLS_COMP BINARY – Tobias G. Aug 31 '22 at 15:18