3

I am trying to isolate an email address from a block of free field text (column name is TEXT).

There are many different variations of preceding and succeeding characters in the free text field, i.e.:

email me! john@smith.com
e:john@smith.com m:555-555-5555
john@smith.com--personal email

I've tried variations of INSTR() and SUBSTRING_INDEX() to first isolate the "@" (probably the one reliable constant in finding an email...) and extracting the characters to the left (up until a space or non-qualifying character like "-" or ":") and doing the same thing with the text following the @.

However - everything I've tried so far hasn't filtered out the noise to the level I need.

Obviously 100% accuracy isn't possible but would someone mind taking a crack at how I can structure my select statement?

Kainix
  • 1,186
  • 3
  • 21
  • 33

2 Answers2

2

There is no easy solution to do this within MySQL. However you can do this easily after you have retrieved it using regular expressions.

Here would be a an example of how to use it in your case: Regex example

If you want it to select all e-mail addresses from one string: Regex Example

You can use regex to extract the ones where it does contain an e-mail in MySQL but it still doesn't extract the group from the string. This has to be done outside MySQL

SELECT * FROM table
WHERE column RLIKE '\w*@\w*.\w*'

RLIKE is only for matching it, you can use REGEXP in the SELECT but it only returns 1 or 0 on whether it has found a match or not :s

If you do want to extract it in MySQL maybe this other stackoverflow post helps you out. But it seems like a lot of work instead of doing it outside MySQL

Community
  • 1
  • 1
Jester
  • 1,408
  • 1
  • 9
  • 21
  • Regex validation of emails is a nightmare in itself https://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address – Elzo Valugi Apr 19 '16 at 21:58
  • @ElzoValugi oh for sure, i made a very basic regex, as an email could contain numbers and certain characters as well. Thomas said in his question that he's not aiming for 100 % accuracy. Probably better to ensure clean input from the start instead of trying to extract e-mailaddresses out of random text which is for sure impossible to get 100 % accuracy on – Jester Apr 19 '16 at 22:03
1

Now in MySQL 5 and 8 you can use REGEXP_SUBSTR to isolate just the email from a block of free text.

SELECT *, REGEXP_SUBSTR(`TEXT`, '([a-zA-Z0-9._%+\-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})') AS Emails FROM `mytable`;

If you want to get just the records with emails and remove duplicates ...

SELECT DISTINCT REGEXP_SUBSTR(`TEXT`, '([a-zA-Z0-9._%+\-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})') AS Emails FROM `mytable` WHERE `TEXT` REGEXP '([a-zA-Z0-9._%+\-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})';
Saj
  • 781
  • 8
  • 7