-1

i am still stucked in screen scraping problem... link : screen scraping in php problem

This problem was solved to little extent by using '&num=100' in google search query which decreased the no. of request 10 times.But captcha problem is still dere. So to overcome it i used...sleep(seconds) function.

Now the problem is I have to scrape it myself(these are orders).that means i dont want to use 'simple_html_dom.php' becuase catching warnings and error is difficult(for me) in this case.i m instructed to do it myself. so how i can i do it.i know to methods: 1. file_get_content() 2. curl.

But its very tedious work to fetch search for ur content and count rank simultaneously.as using regular exp to parse dom is HELL.read this link for convencing urself.link: RegEx match open tags except XHTML self-contained tags

Task to implemented :

  1. catch captcha error(or warning) so i can stop furhter execution.
  2. Have to use headers.so it seems to be genuine and valid humanable request to google.

    simple_html_dom.php cant catch errors.it shows warning when captcha error occurs.How can i catch that warning? Please help...its long working with this module.Please give suggestion to solve each and every problem related here.

Community
  • 1
  • 1
Aakash Sahai
  • 3,935
  • 7
  • 27
  • 41

1 Answers1

0

Don't know about the first problem (captcha), but you can send headers easily with curl, for example:

$ch = curl_init();
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Accept-Charset: utf-8')); 

And to set the user agent:

curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64; rv:2.2a1pre) Gecko/20110324 Firefox/4.2a1pre');
Gnuffo1
  • 3,478
  • 11
  • 39
  • 53