1

i grabbed an html from this url : http://facebook.com/zuck there is no problem to echo it to the client browser but i found it impossible to parse it with php.

i am trying to parse the text inside div tags for example :

preg_match_all("/<div class=\"mediaPageName\">(.*)<\/div>/",$html,$matches);
print_r($matches);

returns empty array i also tried with DOMDocument and with PHP Simple HTML DOM Parser both of them return empty elements and can't grab the text of the html.

how is it even possible? there is a solution to that?

Ben
  • 249
  • 5
  • 19
  • 3
    Scraping is probably a ToS violation. You're better off with graph API when possible: http://graph.facebook.com/zuck – Frank Farmer Jun 01 '11 at 21:59
  • 1
    @Frank Farmer Maybe true, but it is ironic worrying about ToS violations with Facebook, considering all of their privacy issues. – brian_d Jun 01 '11 at 22:02
  • 1
    Is he trying to [parse HTML using a regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)? – Stephen P Jun 01 '11 at 22:45

3 Answers3

3

It is quite possible.

Easiest way is to load the complete DOM into DOMDocument or phpQuery

Edit:

From looking at the source code of the link provided, the element you are searching for is replacing less than characters, < with the unicode representation: \u003c.

Example: \u003cdiv class=\"mediaPageName\">Nirvana\u003c\/div>

Edit 2:
As mentioned by others, do not parse HTML when not necessary. But it looks like this is required in this case as Frank Farmer mentions.

This regex will find some matches (only one per line, hopefully someone can adjust it to get all the matches). preg_match_all('%\\\\u003cdiv class=.*mediaPageName[^>]*>([^>]*)\\\\u003c%i', $html, $matches);

It may be worthwhile finding out how to use Unicode regex as outlined here.

brian_d
  • 11,190
  • 5
  • 47
  • 72
  • @Brad F Jacobs I see that, but it does not mean that it is not the correct solution - it could easily be done with DOMDocument. – brian_d Jun 01 '11 at 22:04
  • @brian_d even when i do only `preg_match('/mediaPageName/',$html)` it returns false, but i checked with echo and i can see it with my eyes the div element with the class mediaPageName. i don't understand. – Ben Jun 01 '11 at 22:07
  • @brian_d i tried to load it to DOMDocument it is not working. – Ben Jun 01 '11 at 22:09
  • @Ben preg_match should return a number and false only on an error. http://php.net/manual/en/function.preg-match.php. You need pattern delimeters - ex) '/mediaPageName/' – brian_d Jun 01 '11 at 22:10
  • @Ben to make sure, is `var_dump(preg_match('/mediaPageName/', $html))` 0 or false? And are you sure `$html` is not empty? – brian_d Jun 01 '11 at 22:13
  • @brian_d `preg_match_all('/
    (.*)<\/div>/', $html,$matches); var_dump($matches);` returns `array(2) { [0]=> array(0) { } [1]=> array(0) { } }` and `echo $html` returns all the html correctly
    – Ben Jun 01 '11 at 22:21
  • 1
    @Ben look at my updated answer, it looks like the divs you are looking for have the `<` signs output in unicode `\u003c` – brian_d Jun 01 '11 at 22:37
  • @brian_d you right! so what is the correct regex to grab it? i tried `preg_match_all('/\\u003cdiv class=\\"mediaPageName\\">(.*)\\u003c\\\/div>/', $html,$matches); var_dump($matches);` not working. – Ben Jun 01 '11 at 22:52
  • The unicode `\u003c` issue is related to the fact that the markup in question is actually part of what looks like a JSON-encoded string wrapped in a – Frank Farmer Jun 01 '11 at 23:03
  • @Ben I have added another edit with a partial solution - it gets some results, but misses others. – brian_d Jun 01 '11 at 23:52
  • @brian_d I figured it out. i just convert the unicode to html with this code `$html = str_replace(array('\u003c','\"','\/'), array('<','"','/'), $html);` then i will get $html with `
    Minimalism
    ` on this data i run this code `preg_match_all('/
    (.*?)<\/div>/', $html, $matches); var_dump($matches);` and done. this is my best solution. i know that must be a way to do it with a single preg_match line. i will work on it. thanks alot! next step is to preg_match the other divs by figuring out the structure of all the elements in the page.
    – Ben Jun 02 '11 at 15:16
2

You're probably going to be much better off in the long run if you just use the Graph API. The profile picture and some basic account information is public and requires no authentication or authorization. Just issue a request for http://graph.facebook.com/zuck/picture for example.

bensnider
  • 3,742
  • 1
  • 24
  • 25
  • i got it but i cant get all the data that i need from the graph api. there is a solution to my problem? – Ben Jun 01 '11 at 22:11
  • 1
    The solution is to say "oh well". Screen scraping Facebook is going to end in tears eventually. – ceejayoz Jun 02 '11 at 02:23
1
$html = str_replace(array('\u003c','\"','\/'), array('<','"','/'), $html);
preg_match_all('/<div class=\"mediaPageName\">(.*?)<\/div>/', $html, $matches); 
var_dump($matches);

must be a way to do it with one single line of preg_match instead of the code above and also grab this tag <span class="fwb">text</span>, but i don't know how to write it in a single line.

Ben
  • 249
  • 5
  • 19