0

I was trying to do web scraping for my personal webpage, using the bio and pics from a website profile (http://about.me/fernandocaldas) so whenever I change that profile the content in my web bio will also do. The desired values are between

    <script type="text/json" class="json user" data-scope="view_profile" data-lowercase_user_name="fernandocaldas">

and

    </script>

Here is my code:

$thtml = file_get_contents('http://about.me/fernandocaldas');
$matchval = '/\<script type=\"text\/json\" class=\"json.*?>(.*?)\<\/script\>/i';
preg_match($matchval, $thtml, $match);
var_dump($match);
if($match){
    echo "match!\n";
     foreach($match[1] as $val)
    {
        echo $val."<br>";
    }
}

But the result is always array(0) {} for the var_dump.

  • 2
    use DOMDocument and DOMXPath – Casimir et Hippolyte Mar 29 '16 at 22:53
  • 1
    I don't really understand why you would expect this to work. The regular expression you are using doesn't even look vaguely like it would match the HTML you have given. In any case I second @CasimiretHippolyte's suggestion of parsing the HTML properly. – Chris Mar 29 '16 at 23:13
  • Revisiting this, I was not even near with the regex. Thanks for your time and answers. – Fernando CR Oct 12 '17 at 04:53

1 Answers1

1

Regular expressions are never a good idea for HTML: today regex seems to work, but tomorrow they will fail!1

Frequently programmers think: “why I have to init a parser, load the HTML, performs a lot of queries if I can do it with only one line of regex code?”. The answer is “why choose the road that leads you in the wrong direction, although shorter?”.

In your case by using a Parser you can also shorten your code.

First, load your HTML page, init a new DOMDocument object, load HTML string into it and init a DOMXPath object (DOMXPath permits to perform complex HTML queries):

$dom = new DOMDocument();
libxml_use_internal_errors(1);
$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );

Search for the element(s) with tag <script> and class “json user”:

$found = $xpath->query( '//script[@class="json user"]' );
if( !$found->length ) die( 'Error retrieving JSON' );

Put the node value of first (and unique, in your page) node in a variable (I also trim it, but it is unnecessary) and decode it with json_decode():

$json = trim( $found->item(0)->nodeValue );
$user = json_decode( $json );

Now, in $user object, you have all the data you need. In $user->first_name you have your first name, in $user->bio you have your biography. By a print_r( $user ) you can display the complete $user structure to see how to access to each element.



1 If the HTML structure change, also a parser will fail.

Community
  • 1
  • 1
fusion3k
  • 11,568
  • 4
  • 25
  • 47
  • *"today regex seems to work, but tomorrow they will fail!"*: Note that it can be also the case with DOMDocument. – Casimir et Hippolyte Mar 30 '16 at 00:52
  • @CasimiretHippolyte Yes, totally right. And with any other parser also. – fusion3k Mar 30 '16 at 09:11
  • You can't imagine to write code by copy-and-paste. Also if your code doesn't work you have to check for your own errors before throw in the towel. I test my code before answering, and identical above code (on http://about.me/fernandocaldas, as per your question) yesterday it worked. Today... it works again: see [this demo](http://phpfiddle.org/main/code/s5mn-30kd) – fusion3k Mar 31 '16 at 00:32
  • Forgive me for the vague comment, It was my big mistake to use readfile() instead of file_get_contents() for retrieving the data. Many thanks for helping me. – Fernando CR Apr 02 '16 at 05:07