0

I am trying to fetch the latest "What's hot" topics from Alexa for my son's research project. I basically just want to grab the words and insert it into a mysql database.

What I currently have is:

<?php
function getAlexa($url)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0");
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}
$grab = getAlexa("http://www.alexa.com/whatshot");

// missing part to get everything between title=''

// mysql connection details are included
include "connectiondetails.php";

// mysql table setup is id (int 11) AI, word (varchar 100) UNIQUE
$insert = @mysql_query("INSERT IGNORE INTO words values('','$word')");

?>

My problem is that I am still new to PHP and I would need to grab all the a href title's (not anchor text as it's shortened) from alexa.com/whatshot as in everything between title=' and the next ' - for example title='hello world' means the word (string) would be hello world - just that I need this to grab all 20 words.

The structure is:

<a href='http://www.alexa.com/whatshot?q=loretta+swit+turns+75' title='Loretta Swit Turns 75'>Loretta Swit Turns 75</a></li><li>2. <a href='http://www.alexa.com/whatshot?q=where+do+i+vote' title='where do I vote'>where do I vote</a></li><li>3. <a href='http://www.alexa.com/whatshot?q=nj+earthquake' title='NJ earthquake'>NJ earthquake</a></li><li>4. <a href='http://www.alexa.com/whatshot?q=yvonne+strahovski' title='Yvonne Strahovski'>Yvonne Strahovski</a></li><li>5. <a href='http://www.alexa.com/whatshot?q=early+voting+results' title='early voting results'>early voting results</a></li></ul><ul class='hotsearches' start='6'><li>6. <a href='http://www.alexa.com/whatshot?q=milt+campbell+dies' title='Milt Campbell Dies'>Milt Campbell Dies</a></li><li>7. <a href='http://www.alexa.com/whatshot?q=bristol+palin+suit+tossed' title='Bristol Palin suit tossed'>Bristol Palin suit...</a></li><li>8. <a href='http://www.alexa.com/whatshot?q=a+gay+lesbian' title='a gay lesbian'>a gay lesbian</a></li><li>9. <a href='http://www.alexa.com/whatshot?q=navy+skipper+fired' title='Navy skipper fired'>Navy skipper fired</a></li><li>10. <a href='http://www.alexa.com/whatshot?q=single+mom+no+tip' title='single mom no tip'>single mom no tip</a></li></ul><ul class='hotsearches' start='11'><li>11. <a href='http://www.alexa.com/whatshot?q=craigslist' title='craigslist'>craigslist</a></li><li>12. <a href='http://www.alexa.com/whatshot?q=nate+silver' title='Nate Silver'>Nate Silver</a></li><li>13. <a href='http://www.alexa.com/whatshot?q=real+clear+politics' title='real Clear Politics'>real Clear Politics</a></li><li>14. <a href='http://www.alexa.com/whatshot?q=93-year-old+bodybuilder' title='93-year-old bodybuilder'>93-year-old bodybu...</a></li><li>15. <a href='http://www.alexa.com/whatshot?q=wreck+it+ralph' title='Wreck It Ralph'>Wreck It Ralph</a></li></ul><ul class='hotsearches' start='16'><li>16. <a href='http://www.alexa.com/whatshot?q=kickstarter' title='Kickstarter'>Kickstarter</a></li><li>17. <a href='http://www.alexa.com/whatshot?q=african+painted+dogs' title='African painted dogs'>African painted dogs</a></li><li>18. <a href='http://www.alexa.com/whatshot?q=red+dawn' title='Red Dawn'>Red Dawn</a></li><li>19. <a href='http://www.alexa.com/whatshot?q=instagram' title='Instagram'>Instagram</a></li><li>20. <a href='http://www.alexa.com/whatshot?q=iphone+5' title='iPhone 5'>iPhone 5</a>

So if all goes well, I would have 20 words in my database once the grab is done.

Thank you for taking time to read this.

Your help is greatly appreciated.

Marcus Weller
  • 361
  • 1
  • 3
  • 11
  • Obligatory reference: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Paystey Nov 05 '12 at 15:00

2 Answers2

2

You can use the following regex to parse all of the title attributes/values from a given input:

title=(?:(?:"([^"]+)")|(?:'([^']+)'))

To use it you can use PHP's preg_match_all() (this assumes your HTML is in your $grab variable):

$titles = array();
preg_match_all('/title=(?:(?:"([^"]+)")|(?:\'([^\']+)\'))/i', $grab, $titles);

This regex will attempt to match any value that's within double or single-quotes. From your sample output, it looks as though it's all in single quotes. The matching titles, in the $titles array, will be split into two groups. The first group is in $titles[1]; these are all the titles that were in double-quotes. $titles[2] contains the list of titles in single quotes.

You could merge them together with array_merge() and then array_filter() to get rid of the empty values. Then, you can iterate through them as normal:

$titles = array_filter(array_merge($titles[1], $titles[2]));
foreach ($titles as $title) {
    .. do whatever you need/want
}

UPDATE (escaped quotes inside of titles)
Per a comment, I've realized that my original regex won't match titles that have escaped quotes within them (such as title="foo \"bar\""). The following regex (ported from this answer) should handle this:

title=(?:(?:"([^"\\]*(?:\\.[^"\\]*)*)")|(?:\'([^'\\]*(?:\\.[^'\\]*)*)\'))

To use this in preg_match_all(), you would use:

preg_match_all('/title=(?:(?:"([^"\\\\]*(?:\\\\.[^"\\\\]*)*)")|(?:\'([^\'\\\\]*(?:\\\\.[^\'\\\\]*)*)\'))/i', $grab, $titles);

Though it's much longer and probably impossible to read without an explanation (if you need one let me know and I'll post), it'll defintely be worth it if you start noticing missing data because of escaped quotes!

Community
  • 1
  • 1
newfurniturey
  • 37,556
  • 9
  • 94
  • 102
0

You want this: http://www.php.net/manual/en/function.preg-match-all.php

And also this: http://www.regular-expressions.info/reference.html

Assuming you have the regex part handled, you'll do something like this:

$titles;
preg_match_all('@YOUR_REGEX_HERE@/i', $grab, $titles);// The "@" is just a regex delimiter
Matthemattics
  • 9,577
  • 1
  • 21
  • 18