5

I am trying to validate email addresses in my google bigquery table. I am using following query to do that but bigquery is throwing an error. I am not sure what is wrong in my code. I found the regex for validation on below story:- What characters are allowed in an email address?

select email
FROM my table
WHERE REGEXP_CONTAINS(email, regex)

here regex is the

Perl's RFC2822 regex

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
KIRIT BHATT
  • 71
  • 1
  • 3
  • 1
    Can you add the error message? it may due to the REGEXP expression or how your specify it. Notice `regex` should be quoted and with a character (`r`) before the quotes. Check [this example](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#regexp_contains) about REGEXP in Standard SQL – enle lin Jan 25 '19 at 14:58

2 Answers2

5

Try this query

SELECT email
FROM table
where email REGEXP '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}'

Note that the last part {2,6} will limit the top domain name to only 6 characthers long, as pointed out by @smoore4.

SELECT email
FROM table
where email REGEXP '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}'

This will only demand a minimum length of 2 from the top domain.

Mad Dog Tannen
  • 7,129
  • 5
  • 31
  • 55
  • 1
    `` is really short for TLDs, see TLD list: https://www.iana.org/domains/root/db ; valid/invalid addresses: https://en.wikipedia.org/wiki/Email_address#Examples ; regex for RFC822 email address: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html – Toto Jan 25 '19 at 15:12
  • 2
    Is this still working in BigQuery? This works in BigQuery `REGEXP_CONTAINS(email, r'^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}$')`. – p13rr0m Aug 12 '22 at 08:17
  • you should create your comment as an answer p13rr0m, imho – smoore4 Jan 17 '23 at 22:18
  • 1
    the last part about {2,6} is mostly ok but it will leave off new valid domains such as .shopping, .digital, .computer, etc – smoore4 Jan 17 '23 at 22:27
0

I think this is cleaner...

SELECT email 
FROM table 
where REGEXP_CONTAINS(email, r"@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+") 
  • 1
    This is not correct. For example, email@email.com@ and @email.com are not valid email addresses but would work according to this regex. – p13rr0m Aug 12 '22 at 08:10