0

I'm trying to scrape a website using some regex. But the site isn't written in well formatted html. In fact, the html is horrible and not structured hardly at all. But I've managed to tackle most of it. The problem I'm encountering now is that in some emails, a span is wrapped around a random part of the email like so:

****.*******@g<span class="tournamenttext">mail.com</span>
************<span class="tournamenttext">@yahoo.com</span>
<span class="tournamenttext">**********@mail.com</span>
*******@gmail.com

Is there a way to retrieve the emails with all this inconsistency?

LordZardeck
  • 7,953
  • 19
  • 62
  • 119
  • where are these text present in php file or some text or database.. and you be more specific about this. – Rafee Nov 18 '11 at 07:44
  • i'm scraping from a website like I said. I have no idea whether it's stored as static html or in a database. i assume static html since there's so much inconsistency – LordZardeck Nov 18 '11 at 07:52
  • 1
    If the span tags are wrapped around *randomly* then that's most likely intended to aggravate email address harvesting. – mario Nov 18 '11 at 07:55
  • 1) it's not entirely random. only a small number of them are like that, and the unstructured tags are not just with the emails. 2) i have permission to scrape the website so i'm not harvesting the emails – LordZardeck Nov 18 '11 at 08:06
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – outis Nov 18 '11 at 08:07

2 Answers2

1

You could simply remove all span tags by replacing </?span[^>]*> with nothing and try your favourite email address finder on the result.

Jens
  • 25,229
  • 9
  • 75
  • 117
1
$string ='****.*******@g<span class="tournamenttext">mail.com</span>
************<span class="tournamenttext">@yahoo.com</span>
<span class="tournamenttext">**********@mail.com</span>
*******@gmail.com';

$pattern = "/<\/?span[^>]*>/";
$string = preg_replace($pattern, "", $string);

after that $string will be only mails

****.*******@gmail.com
************@yahoo.com
**********@mail.com
*******@gmail.com

Your code will be like this

$text[1]->innertext = "Where innertext contains something like: "<em>Local (Open)
 Tournament.</em> ****.*******@g<span class="tournamenttext">mail.com</span>"

// Firstly clear spans
$pattern = "/<\/?span[^>]*>/";
$text[1]->innertext = preg_replace($pattern, "", $text[1]->innertext);

// Preg Match mail
$email_regex = "^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$"; // Just an example email match regex
preg_match($email_regex, $text[1]->innertext, $theMatch);
echo '<pre>' . print_r($theMatch, true) . '</pre>'; 
Utku Yıldırım
  • 2,277
  • 16
  • 20
  • that looks like what i'd want, but is there any way to do the same thing using the class? that way i wouldn't remove any unnecessary code? – LordZardeck Nov 18 '11 at 07:53
  • create a clear function in class like private function clear($string) { $pattern = "/<\/?span[^>]*>/"; return preg_replace($pattern, "", $string); } – Utku Yıldırım Nov 18 '11 at 08:13
  • if I preg_match, i get an empty array – LordZardeck Nov 18 '11 at 08:23
  • preg_match("/<\/?span[^>]*>/", $text[1]->innertext, $theMatch); echo '
    ' . print_r($theMatch, true) . '
    '; Where innertext contains something like: "Local (Open) Tournament. ****.*******@gmail.com. Epic Systems 1979"
    – LordZardeck Nov 18 '11 at 08:52
  • ok, you were right in the first place. I was mistaken. I was supposed to use $item->innertext instead of $text[1]->innertext. The $text variable didn't contain it. Thanks! – LordZardeck Nov 18 '11 at 20:03