I am trying to read a 12MB+ file which has a large HTML table which looks like this:
<table>
<tr>
<td>a</td>
<td>b</td>
<td>c</td>
<td>d</td>
<td>e</td>
</tr>
<tr>
<td>a</td>
<td>b</td>
<td>c</td>
<td>d</td>
<td>e</td>
</tr>
<tr>..... up to 20,000+ rows....</tr>
</table>
Now this is how I'm scraping it:
<?
require_once 'phpQuery-onefile.php';
$d = phpQuery::newDocumentFile('http://localhost/test.html');
$last_index = 20000;
for ($i = 1; $i <= $last_index; $i++)
{
$set['c1'] = $d['tr:eq('.$i.') td:eq(0)']->text();
$set['c2'] = $d['tr:eq('.$i.') td:eq(1)']->text();
$set['c3'] = $d['tr:eq('.$i.') td:eq(2)']->text();
$set['c4'] = $d['tr:eq('.$i.') td:eq(3)']->text();
$set['c5'] = $d['tr:eq('.$i.') td:eq(4)']->text();
}
// code to insert to db here...
?>
My benchmark says it takes around 5.25 hours to scrape and insert 1,000 rows to db. Given that data, it will take around 5 days just to finish the whole 20,000+ rows.
My local machine is running on:
- XAMPP
- Win 7
- proc, i3 2100 3.1GHz
- ram, G.Skill RipJaws X 4GB dual
- HDD, old SATA
Is there any way I can speed up the process? Maybe I'm scraping it the wrong way? Note that the file is accessible locally hence I used http://localhost/test.html
Slightly faster solution:
for ($i = 1; $i <= $last_index; $i++)
{
$r = $d['tr:eq('.$i.')'];
$set['c1'] = $r['td:eq(0)']->text();
$set['c2'] = $r['td:eq(1)']->text();
$set['c3'] = $r['td:eq(2)']->text();
$set['c4'] = $r['td:eq(3)']->text();
$set['c5'] = $r['td:eq(4)']->text();
}
// code to insert to db here...
?>