i have about 66Million domains in a MySQL table, i need to run crawler on all the domains and update the row count = 1 when the crawler completed.
the crawler script is in php using php crawler library here is the script.
set_time_limit(10000);
try{
$strWebURL = $_POST['url'];
$crawler = new MyCrawler();
$crawler->setURL($strWebURL);
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");
$crawler->enableCookieHandling(true);
$crawler->setTrafficLimit(1000 * 1024);
$crawler->setConnectionTimeout(10);
//start of the table
echo '<table border="1" style="margin-bottom:10px;width:100% !important;">';
echo '<tr>';
echo '<th>URL</th>';
echo '<th>Status</th>';
echo '<th>Size (bytes)</th>';
echo '<th>Page</th>';
echo '</tr>';
$crawler->go();
echo '</table>';
$this->load->model('urls');
$this->urls->incrementCount($_POST['id'],'urls');
}catch(Exception $e){
}
$this->urls->incrementCount(); only update the row and to mark the count column = 1
and because i have 66M domains i needed to run a cronjob on my server and as cronjob runs on command line i needed a headless browser so i choose phanjomjs because the crawler doesnt work the way i wanted it to work without the headless browser (phantomjs)
first problem i faced was to load domains from mysql db and run crawler script from a js script i tried this:
- create a php script that returns domains in json form and load it from js file and foreach the domains and run the crawler, but it didnt work very well and get stuck after sometime
- next thing i tried, which im still using is create a python script to load the domains directly from mysql db and run the phantom js script on each domains from python script.
here is the code
import MySQLdb
import httplib
import sys
import subprocess
import json
args = sys.argv;
db = MySQLdb.connect("HOST","USER","PW","DB")
cursor = db.cursor()
#tablecount = args[1]
frm = args[1]
limit = args[2]
try:
sql = "SELECT * FROM urls WHERE count = 0 LIMIT %s,%s" % (frm,limit)
cursor.execute(sql)
print "TOTAL RECORDS: "+str(cursor.rowcount)
results = cursor.fetchall()
count = 0;
for row in results:
try:
domain = row[1].lower()
idd = row[0]
command = "/home/wasif/public_html/phantomjs /home/wasif/public_html/crawler2.js %s %s" % (domain,idd)
print command
proc = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
script_response = proc.stdout.read()
print script_response
except:
print "error running crawler: "+domain
except:
print "Error: unable to fetch data"
db.close()
it takes 2 arguments to set the limit to select domain from database.
foreach domains and run this command using subproces
command = "/home/wasif/public_html/phantomjs /home/wasif/public_html/crawler2.js %s %s" % (domain,idd)
command
proc = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
script_response = proc.stdout.read()
print script_response
crawler2.js file also takes 2 args 1 is domain and 2nd is the id to update the count = 1 when crawler completed this is the crawler2.js
var args = require('system').args;
var address = '';
var id = '';
args.forEach(function(arg, i) {
if(i == 1){
address = arg;
}
if(i == 2){
id = arg;
}
});
address = "http://www."+address;
var page = require('webpage').create(),
server = 'http://www.EXAMPLE.net/main/crawler',
data = 'url='+address+'&id='+id;
console.log(data);
page.open(server, 'post', data, function (status) {
if (status !== 'success') {
console.log(address+' Unable to post!');
} else {
console.log(address+' : done');
}
phantom.exit();
});
it works well but my script get stuck after sometime n need to restart after sometime and log shows nothing wrong
i need to optimize this process and run crawler as fast as i can, any help would be appreciated