1
// get CONTENT from united domains footer
$content = file_get_contents('http://www.uniteddomains.com/index/footer/');

// remove spaces from CONTENT
$content = preg_replace('/\s+/', '', $content);

// match all tld tags
$regex = '#target="_parent">.(.*?)</a></li><li>#';
preg_match($regex, $source, $matches);


print_r($matches);

I am wanting to match all of the TLDs:

Each tld is preceded by target="_parent">. and followed by </a></li><li>

I am wanting to end up with an array like array('africa','amsterdam','bnc'...ect ect )

What am I doing wrong here?

NOTE: The second step to remove all the spaces is just to simplify things.

Perry
  • 11,172
  • 2
  • 27
  • 37
user1512405
  • 19
  • 1
  • 6
  • 1
    This is still HTML parsing which should be done with [an appropriate HTML parser](http://stackoverflow.com/q/3577641/53114) and not regular expressions. – Gumbo Jul 28 '13 at 20:00
  • It is not HTML parsing, it is finding a particular pattern in a string that happens to be HTML. – Daniel Gimenez Jul 28 '13 at 20:06

2 Answers2

3

Here's a regular expression that will do it for that page.

\.\w+(?=</a></li>)

REY

PHP

$content = file_get_contents('http://www.uniteddomains.com/index/footer/');
preg_match_all('/\.\w+(?=<\/a><\/li>)/m', $content, $matches);
print_r($matches);

PHPFiddle

Here are the results:

.africa, .amsterdam, .bcn, .berlin, .boston, .brussels, .budapest, .gent, .hamburg, .koeln, .london, .madrid, .melbourne, .moscow, .miami, .nagoya, .nyc, .okinawa, .osaka, .paris, .quebec, .roma, .ryukyu, .stockholm, .sydney, .tokyo, .vegas, .wien, .yokohama, .africa, .arab, .bayern, .bzh, .cymru, .kiwi, .lat, .scot, .vlaanderen, .wales, .app, .blog, .chat, .cloud, .digital, .email, .mobile, .online, .site, .mls, .secure, .web, .wiki, .associates, .business, .car, .careers, .contractors, .clothing, .design, .equipment, .estate, .gallery, .graphics, .hotel, .immo, .investments, .law, .management, .media, .money, .solutions, .sucks, .taxi, .trade, .archi, .adult, .bio, .center, .city, .club, .cool, .date, .earth, .energy, .family, .free, .green, .live, .lol, .love, .med, .ngo, .news, .phone, .pictures, .radio, .reviews, .rip, .team, .technology, .today, .voting, .buy, .deal, .luxe, .sale, .shop, .shopping, .store, .eus, .gay, .eco, .hiv, .irish, .one, .pics, .porn, .sex, .singles, .vin, .vip, .bar, .pizza, .wine, .bike, .book, .holiday, .horse, .film, .music, .party, .email, .pets, .play, .rocks, .rugby, .ski, .sport, .surf, .tour, .video

Community
  • 1
  • 1
Daniel Gimenez
  • 18,530
  • 3
  • 50
  • 70
0

Using the DOM is cleaner:

$doc = new DOMDocument();
@$doc->loadHTMLFile('http://www.uniteddomains.com/index/footer/');
$xpath = new DOMXPath($doc);
$items = $xpath->query('/html/body/div/ul/li/ul/li[not(@class)]/a[@target="_parent"]/text()');
$result = '';
foreach($items as $item) {
    $result .= $item->nodeValue; }
$result = explode('.', $result);
array_shift($result);
print_r($result);
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • How would I make this where it will only match lowercase? Using that exact code it pulls "Geographic & Travel" and other header text. – user1512405 Jul 28 '13 at 20:11