0

I have texts that can contain one email address or multiple ones. I use regex to match these. First I used: (from this previous question)

[A-Za-z0-9_-]+@[A-Za-z0-9_-]+\.([A-Za-z0-9_-][A-Za-z0-9_]+)

This caused two problems. In the case a . was used before the @ this was problematic, but also if an email address ended in two or more domain extensions (for example ...@domain.co.uk) it did not work. So I changed this expression to

^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})

This solves both first problems, but creates a new one. If in the text the email address is right before a full stop, this is now included in the address! So this text gives me problems:

Please email us at: some@example.com. You can also mail us at some@example.co.uk. Etc...

Is there a way to exclude this last . if it is followed by either a blank space or a line break?

ps. I do not need to validate email addresses, I need to make sure my expression knows where an email address (or multiple) are in a text and when they stop.

Dirk J. Faber
  • 4,360
  • 5
  • 20
  • 58
  • 1
    You could for example change your regex to `([a-z0-9_.-]+)@((?:[\da-z.-]+)\.)+([a-z]{2,6})` [demo](https://regex101.com/r/tAUs26/1). That will repeat the part after the @ sign including the first dot 1+ times. Then omit the dot in the last part. – The fourth bird Nov 22 '18 at 21:06
  • [here](https://regex101.com/r/i1EfNo/2) is another one, that seems to fit your needs: `[A-Za-z0-9_-]+@[A-Za-z0-9_-]+\.([A-Za-z][A-Za-z]+)+(.[A-Za-z]{2,})?` This takes the last dot only if it's followed by 2 or more a-zA-Z. (and I'vetaken out `_-` from the top level domains, as I don't think they would be valid there.) – Jeff Nov 22 '18 at 21:09
  • Add a `\b` at the end. `[a-z.]{2,6}\b`. Replace `a-z` with `\p{L}` or `A-Za-z0-9_` with `\w` and add `/u` modifier to match all Unicode letters. Do not use `^` at the start. See this [**regex demo**](https://regex101.com/r/g7QR27/1). – Wiktor Stribiżew Nov 22 '18 at 21:14
  • 1
    @Jeff - That regexe doesn't allow emails with foreign characters, dashes or numbers, like: `hello@åä-ö.com`, while `åä-ö.com` actually is a valid domain. You should also be able to have dots in the name-part. When matching email addresses (and URL's), you shouldn't be too strict. – M. Eriksson Nov 22 '18 at 21:15
  • @MagnusEriksson true. I only edited the last part of the shown regex so didn't take care about "foreign" (like my language) characters. Dashes is toplevel domains aren't valid though, right? I'm deleting my comment, it wasn't too good anyway... – Jeff Nov 22 '18 at 21:20
  • @Jeff - No, not in TLD's (`.com`, `.se`, .`org`), but they are in domain names: are: `some-example.com`. – M. Eriksson Nov 22 '18 at 21:21
  • 1
    I had them in there: `..@[A-Za-z0-9_-]..` - but of course I forgot about subdomains like `spam@sub-domain.my-host.com` – Jeff Nov 22 '18 at 21:23
  • @Jeff, I think Fourth Bird's solution is better because this also includes addresses like `example@so.il.uk` – Dirk J. Faber Nov 22 '18 at 21:23
  • @Jeff - Sure, but you're not accounting for sub domains and sub-sub domains etc. – M. Eriksson Nov 22 '18 at 21:25
  • 1
    @DirkJ.Faber absolutely right. Maybe take Magnus' comments about mine (for `ä') into that aswell. – Jeff Nov 22 '18 at 21:25
  • 1
    Basically, look for a good already made regex for this online. Trying to do it yourself is usually _really_ painful. If you check regexes that takes most rules into account, they are _huge_... – M. Eriksson Nov 22 '18 at 21:26
  • 1
    @DirkJ.Faber [Mine matches `example@so.il.uk`, too](https://regex101.com/r/g7QR27/3) – Wiktor Stribiżew Nov 22 '18 at 21:27
  • @WiktorStribiżew, beautiful expression! – Dirk J. Faber Nov 22 '18 at 21:48
  • Does it work for all of your cases? – Wiktor Stribiżew Nov 22 '18 at 21:54
  • @WiktorStribiżew, not entirely yet because for some reason my php app won't recognize foreign characters with this expression, even though it 100% should. – Dirk J. Faber Nov 22 '18 at 21:58
  • See https://ideone.com/ThKBw5 – Wiktor Stribiżew Nov 22 '18 at 22:00
  • 1
    @WiktorStribiżew, perfect, thank you! – Dirk J. Faber Nov 22 '18 at 22:03

1 Answers1

1

You may use

/[\p{L}0-9_.-]+@[0-9\p{L}.-]+\.[a-z.]{2,6}\b/u

See the regex demo. Or, to only start matching from a letter or digit:

/[\p{L}0-9][\p{L}0-9_.-]*@[0-9\p{L}.-]+\.[a-z.]{2,6}\b/u

\p{L} will match all Unicode base letters (add \p{M} if you need to also match diacritics, though I doubt there are any here) and add a word boundary at the end to stop before a dot. Remove all unnecessary groupings that you are not using.

See the PHP demo:

$re = '/[\p{L}0-9_.-]+@[0-9\p{L}.-]+\.[a-z.]{2,6}\b/u';
$str = 'Please email us at: some@example.com. You can also mail us at some@example.co.uk. Etc... hello@åä-ö.com
example@so.il.uk';
if (preg_match_all($re, $str, $matches)) {
  print_r($matches[0]);
}

Output:

Array
(
    [0] => some@example.com
    [1] => some@example.co.uk
    [2] => hello@åä-ö.com
    [3] => example@so.il.uk
)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563