0

I'm uploading Tab Delimited text files exported from Excel, basically i'm getting all email fields on the document, by using a preg_match_all

preg_match_all("/([\s]*)[\._a-zA-Z0-9-]+@[\._a-zA-Z0-9-]+/i",$string,$emails);

In some cases emails are saved with a url attached to it like this: prefix.user@domain.comwww.domain.com

i need to strip only the email without the url that's afterwards the email address

how can i make this work using regular expressions??

Ivan Bravo Carlos
  • 1,650
  • 3
  • 15
  • 22
  • Not well... can you get your data-supplier to fix the files; it would save you a lot of bother in the longer run. – Ben Dec 27 '12 at 10:26
  • The data is from an 8 year-old excel contact directory there is like 200 files of it :S – Ivan Bravo Carlos Dec 27 '12 at 10:32
  • Well putting aside the arguable pattern, it's part `@[\._a-zA-Z0-9-]+` cannot possibly catch @domain.comwww.domain.com unless it is present as is in the input. And `domain.comwww.domain.com` is a legal domain in .com zone. Do all of your "false" captures look like "domain-without-www-followed-by-domain-with-www"? – Max Yakimets Dec 27 '12 at 10:41
  • No, there is like a 15% of the input with the url attached, i've done some tests with 10 different files and this anomaly is present in all of them, what i notice is that every url starts with www. but... dont know if there is some with http or https too... which i believe there are – Ivan Bravo Carlos Dec 27 '12 at 10:54
  • i'm thinking on exploding the regex result into 2 strings one with the actual email and the other with the junk – Ivan Bravo Carlos Dec 27 '12 at 10:58
  • Is your task to extract emails once and ditch the excel files? If so, manual walkthrough and fixing each erroneous pattern would be ok - it would be way easier than inventing AI to decide which email is ok) – Max Yakimets Dec 27 '12 at 11:02
  • Well my job is to create a simple tool that allows users to upload a txt file with bunch of contact info and get only the emails for their newsletter database, they will still using excel as a db engine, why? don't know but they don't want to do that kind of work,.. i've exploded the string this works for now, until i find a much aesthetic approach with regex – Ivan Bravo Carlos Dec 27 '12 at 11:10

1 Answers1

1

List all possible domain names in last regexp group like this, including top-level domains by default.

[A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)

You can read more about email validation here or read the related question here.

UPDATE

Expression conforming to RFC 2822 standard

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[a-zA-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)
Community
  • 1
  • 1
Paul T. Rawkeen
  • 3,994
  • 3
  • 35
  • 51