0

I'll give you the gist.

I'm trying to scrape certain URL's using a third party HTML tag stripper because I don't think the default strip_tags() does the job well. (I don't think you need to check that scraper)

Now sometimes, the HTML source code of some sites contains some weird code that is causing my HTML tag stripper to fail.

One such example is this site that contains the following piece of code :

<li><a href="<//?=$cnf['website']?>girls/models-photo-gallery/?sType=6#top_menu">Photo Galleries</a></li>

that causes the above mentioned tag stripper to throw this error :

Parse error: syntax error, unexpected T_ENCAPSED_AND_WHITESPACE, expecting T_STRING or T_VARIABLE or T_NUM_STRING in /var/www/GET Tweets/htdocs/tmhOAuth-master/examples/class.html2text.inc(429) : regexp code on line 1

Fatal error: preg_replace() [<a href='function.preg-replace'>function.preg-replace</a>]: Failed evaluating code: $this-&gt;_build_link_list(&quot;&lt;//?=$cnf[\'website\']?&gt;girls/models-photo-gallery/?sType=6#top_menu&quot;, &quot;Photo Galleries&quot;) in /var/www/GET Tweets/htdocs/tmhOAuth-master/examples/class.html2text.inc on line 429

Now what happens is, there is an array of many URLs and some throw the abovementioned error. I do some processing on each URL.

If some URL in the array throws an error like this, I want the execution to proceed ahead with processing of next URL without it disturbing anything. My code is something like this:

foreach ($results as $result)
{
    $url=$result->Url;

    $worddict2=myfunc($url,$worddict2,$history,$n_gram);        
}

Here myfunc does the processing and uses the 3rd party HTML stripper I mentioned before. I tried modifying the code to this:

foreach ($results as $result)
    {
        $url=$result->Url;
        $worddicttemp=array();
        try
        {
            $worddicttemp=myfunc($url,$worddict2,$history,$n_gram); //returns the string represenation of what matters, hopefully
            //The below line will be executed only when the above function doesn't throw a fatal error
            $worddict2=$worddicttemp;
        }
        catch(Exception $e)
        {
            continue;
        }
    }

But I'm still getting the same error. What is wrong? Why is the code inside myfunc() now transferring control to the catch blocks as soon as it encounters that fatal error?

Programming Noob
  • 1,755
  • 5
  • 19
  • 28
  • use strstr to check if you have error in $worddicttemp if true then use continue to go to next url – Sohail Ahmed Nov 26 '12 at 07:55
  • A HTML stripper that uses the preg_replace 'e' modifier is pretty wild. I would look for some other solution as the functionality in question is going the way of the dodo. – cleong Nov 26 '12 at 08:27
  • I believe the eval modifier for preg_* is being removed sooner or later, might be better to get rid now. – Dale Nov 26 '12 at 08:35
  • Yes it is due to be deprecated see [here](http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php) – Dale Nov 26 '12 at 08:37
  • Using it in this context is insane in any event. – cleong Nov 26 '12 at 08:40

2 Answers2

0

I propose you to use some beautifier script like Tidy before parsing. And your problem can be solved by adding

$html_content = htmlspecialchars($html_content)
Dmitriy Sushko
  • 242
  • 1
  • 6
-1

You can't catch Parse Errors (or any Fatal Errors for that matter, but Parse Errors are even worse since they'll be generated as soon as the code is loaded). The best way I know of to isolate them is to run completely independent PHP processes for whatever you want to recover from and expect to generate Fatal Errors.

See also How do I catch a PHP Fatal Error

Community
  • 1
  • 1
FoolishSeth
  • 3,953
  • 2
  • 19
  • 28