0

I am scraping some url from a webpage and its showing fine on page, but when I insert the url into database it inserted some wierd like this

http://westseattleblog.com/event/west-seattle-church-listings/?instance_id=567059

my code

foreach($html->find('div[class=ai1ec-btn-group ai1ec-actions] a') as $element)
{
    $url= $element->href;
    $url1=mysql_real_escape_string($url);
    $sql="insert into catlink(catlink) values('$url1')";
    //echo $sql."<br>";
    $query=mysql_query($sql);
    //newpage
} 

And when I start fetching the url from database and scraping one by one, it shows nothing.

my code

$sql1="select * from links limit 10";
$query1=mysql_query($sql1);
while($res=mysql_fetch_assoc($query1)){
    $url=$res['url'];

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_POST, 1);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    // curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
    // curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3");
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    $page = curl_exec($ch);
    curl_close($ch);
    $dom = new simple_html_dom();
    $html = $dom->load($page);
    foreach($html->find("div") as $a){
        echo $a->innertext;
    }
    //$separator = '&nbsp;-&nbsp;';
}
CyberSoul
  • 21
  • 7
  • Do you mean you get nothing at all from your query to the DB? In regards to the garbled content, it should be fine. Check [this article](http://stackoverflow.com/questions/7867204/how-should-be-kept-as-html-tags-in-database). – Paulo Hgo Feb 26 '17 at 18:47
  • Base64 encode the URL to a safe Base64 string then save that instead to your database. You can easily get back your original URL when you Base64 decode the saved string from your database. See http://stackoverflow.com/questions/13109588/base64-encoding-in-java – Joseph Feb 26 '17 at 21:49

1 Answers1

0

Your URL have hex characters so you need to use html_entity_decode to decode it before you insert it in your database or before using it with cURL

So :

$url1=mysql_real_escape_string(html_entity_decode($url));

or

$url=html_entity_decode($res['url']);
Fabien TheSolution
  • 5,055
  • 1
  • 18
  • 30