1

i have about 66Million domains in a MySQL table, i need to run crawler on all the domains and update the row count = 1 when the crawler completed.

the crawler script is in php using php crawler library here is the script.

set_time_limit(10000);
        try{

            $strWebURL          =   $_POST['url'];
            $crawler    =   new MyCrawler();
            $crawler->setURL($strWebURL);
            $crawler->addContentTypeReceiveRule("#text/html#");
            $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");
            $crawler->enableCookieHandling(true);
            $crawler->setTrafficLimit(1000 * 1024);
            $crawler->setConnectionTimeout(10);

            //start of the table
            echo '<table border="1" style="margin-bottom:10px;width:100% !important;">';
            echo '<tr>';
            echo '<th>URL</th>';
            echo '<th>Status</th>';
            echo '<th>Size (bytes)</th>';
            echo '<th>Page</th>';
            echo '</tr>';
            $crawler->go();
            echo '</table>';

            $this->load->model('urls');
            $this->urls->incrementCount($_POST['id'],'urls');

        }catch(Exception $e){

        }

$this->urls->incrementCount(); only update the row and to mark the count column = 1

and because i have 66M domains i needed to run a cronjob on my server and as cronjob runs on command line i needed a headless browser so i choose phanjomjs because the crawler doesnt work the way i wanted it to work without the headless browser (phantomjs)

first problem i faced was to load domains from mysql db and run crawler script from a js script i tried this:

  • create a php script that returns domains in json form and load it from js file and foreach the domains and run the crawler, but it didnt work very well and get stuck after sometime
  • next thing i tried, which im still using is create a python script to load the domains directly from mysql db and run the phantom js script on each domains from python script.

here is the code

import MySQLdb
import httplib
import sys
import subprocess
import json

args = sys.argv;

db = MySQLdb.connect("HOST","USER","PW","DB")
cursor = db.cursor()
#tablecount = args[1]
frm = args[1]
limit = args[2]

try:
    sql = "SELECT * FROM urls WHERE count = 0 LIMIT %s,%s" % (frm,limit)
    cursor.execute(sql)
    print "TOTAL RECORDS: "+str(cursor.rowcount)
    results = cursor.fetchall()
    count = 0;
    for row in results:
        try:
            domain = row[1].lower()
            idd = row[0]
            command = "/home/wasif/public_html/phantomjs /home/wasif/public_html/crawler2.js %s %s" % (domain,idd)
            print command
            proc = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
            script_response = proc.stdout.read()
            print script_response
        except:
            print "error running crawler: "+domain

except:
    print "Error: unable to fetch data"
db.close()

it takes 2 arguments to set the limit to select domain from database.

foreach domains and run this command using subproces

command = "/home/wasif/public_html/phantomjs /home/wasif/public_html/crawler2.js %s %s" % (domain,idd)
command
proc = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
script_response = proc.stdout.read()
        print script_response

crawler2.js file also takes 2 args 1 is domain and 2nd is the id to update the count = 1 when crawler completed this is the crawler2.js

var args = require('system').args;
var address = '';
var id = '';
args.forEach(function(arg, i) {
    if(i == 1){
       address = arg;
    }

    if(i == 2){
        id = arg;
    }
});

address = "http://www."+address;

var page = require('webpage').create(),
server = 'http://www.EXAMPLE.net/main/crawler',
data = 'url='+address+'&id='+id;

console.log(data);

page.open(server, 'post', data, function (status) {
    if (status !== 'success') {
        console.log(address+' Unable to post!');
    } else {
        console.log(address+' : done');
    }
    phantom.exit();
});

it works well but my script get stuck after sometime n need to restart after sometime and log shows nothing wrong

i need to optimize this process and run crawler as fast as i can, any help would be appreciated

Wasif Khalil
  • 2,217
  • 9
  • 33
  • 58

1 Answers1

0

Web crawler programmer is in here. :)

Your python execute the phantom serially. You should do it in parallel. To do it, execute the phantom then leave it, don't wait it.

In PHP, would be like this:

exec("/your_executable_path > /dev/null &");

Don't use phantom if you don't need to. It render everything. > 50MB memory will be needed.

Iwanio
  • 9
  • 2
  • without phantom my crawler wasnt running as i wanted it to, can u post an example how to run it in parallel?? – Wasif Khalil Mar 01 '15 at 09:14
  • Check the sample code in my answer. But, don't do 50M crawler in parallel, but do it 10 or 20 at a time. It depend on your server infrastructure. – Iwanio Mar 01 '15 at 09:26
  • You have said, "Without phantom my crawler wasnt running as i wanted it". What do you want phantom do for you? I use phantom only for screen grab, pdf rendering, or running heavy javascript`ed website that can't be scrap using curl. – Iwanio Mar 01 '15 at 09:28
  • i need to use it to make it look like that the sites are opened using an actual browser, not on command line – Wasif Khalil Mar 01 '15 at 09:30
  • do u want me to replace subprocess with exec on python? – Wasif Khalil Mar 01 '15 at 09:32
  • Your problem here is that you wait for the child process output. You must leave the process running, and when finished, the process save the output itself. Sorry, I'm not python programmer. – Iwanio Mar 01 '15 at 09:36
  • yea i think ur right but i dont know how to do that, how to leave the process running without waiting for output – Wasif Khalil Mar 01 '15 at 09:51
  • @WasifKhalil nohup with `&` is your friend on the commandline or use python threads. – Artjom B. Mar 01 '15 at 10:07