0

I am trying to read a 12MB+ file which has a large HTML table which looks like this:

<table>
    <tr>
        <td>a</td>
        <td>b</td>
        <td>c</td>
        <td>d</td>
        <td>e</td>
    </tr>
    <tr>
        <td>a</td>
        <td>b</td>
        <td>c</td>
        <td>d</td>
        <td>e</td>
    </tr>
    <tr>..... up to 20,000+ rows....</tr>
</table>

Now this is how I'm scraping it:

<?

require_once 'phpQuery-onefile.php';

$d = phpQuery::newDocumentFile('http://localhost/test.html');

$last_index = 20000;

for ($i = 1; $i <= $last_index; $i++)
{
    $set['c1']  = $d['tr:eq('.$i.') td:eq(0)']->text();
    $set['c2']  = $d['tr:eq('.$i.') td:eq(1)']->text();
    $set['c3']  = $d['tr:eq('.$i.') td:eq(2)']->text();
    $set['c4']  = $d['tr:eq('.$i.') td:eq(3)']->text();
    $set['c5']  = $d['tr:eq('.$i.') td:eq(4)']->text();
}

// code to insert to db here... 

?>

My benchmark says it takes around 5.25 hours to scrape and insert 1,000 rows to db. Given that data, it will take around 5 days just to finish the whole 20,000+ rows.

My local machine is running on:

  • XAMPP
  • Win 7
  • proc, i3 2100 3.1GHz
  • ram, G.Skill RipJaws X 4GB dual
  • HDD, old SATA

Is there any way I can speed up the process? Maybe I'm scraping it the wrong way? Note that the file is accessible locally hence I used http://localhost/test.html

Slightly faster solution:

for ($i = 1; $i <= $last_index; $i++)
{
    $r = $d['tr:eq('.$i.')'];

    $set['c1']  = $r['td:eq(0)']->text();
    $set['c2']  = $r['td:eq(1)']->text();
    $set['c3']  = $r['td:eq(2)']->text();
    $set['c4']  = $r['td:eq(3)']->text();
    $set['c5']  = $r['td:eq(4)']->text();
}

// code to insert to db here... 

?>
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
IMB
  • 15,163
  • 19
  • 82
  • 140
  • You should be using a readymade table extracting library, not collect the data yourself. (For example http://blog.mspace.fm/2009/10/14/parse-an-html-table-with-php/ - albeit you have to watch out if that regex is sufficiently robust for your case.) – mario Nov 10 '11 at 18:48
  • @mario Isn't phpQuery already a readymade library? – IMB Nov 10 '11 at 18:56

1 Answers1

2

I have never worked with phpQuery, but that looks like a very sub-optimal way to parse a huge document: It's possible that phpQuery has to walk through the whole thing every time you make it load a row using tr:eq('.$i.').

The much more straightforward (and probably also much faster) way would be to simply walk through each tr element of the document, and deal with each element's children in a foreach loop. You wouldn't even need phpQuery for that.

See How to Parse XML File in PHP for a variety of solutions.

Community
  • 1
  • 1
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • @IMB if the HTML is clean, it won't matter. However, you can try sticking with phpQuery first, you just have to change your approach: Make phpQuery load all the `tr`s in one go (i.e. all the `table`s children named `tr`...), then walk through them. That might already be faster by orders of magnitude – Pekka Nov 10 '11 at 18:33
  • The HTML isn't really clean. I kinda get what you're saying but I don't know how to do that in code other than what I did above. How do I load all the TRs without walking through each ? I would have to do a double foreach, to get each TDs right? sounds like more slower to me though. – IMB Nov 10 '11 at 18:40
  • 1
    @IMB believe me, it's probably going to be much faster. The way you are doing it right now, it may have to *parse the whole document* on every iteration. I don't know how phpQuery is implemented but this is almost certainly the most sub-optimal way to do it. – Pekka Nov 10 '11 at 18:43
  • You're right thanks. I rethink what you said and I only needed one foreach. I edited the code above and it is now 20% faster. Although I think it can even be more faster LOL, what you think? – IMB Nov 10 '11 at 19:04
  • @IMB mmm, that's still not what I mean. I mean load all the children at once. Hang on, I'll look for an example – Pekka Nov 10 '11 at 19:06
  • @IMB I can't find many live examples of phpQuery but this is what you want to be looking at: http://code.google.com/p/phpquery/wiki/Traversing especially `children()`. Load *all the `tr`s at once*, get them back from phpQuery, then walk through them using `foreach` – Pekka Nov 10 '11 at 19:10
  • I was testing how to use `children()` unfortunately I have no idea how to use it in this case. By doing `$string = $d['#table_id'];` allows me to capture the entire table in a string, hence containing all rows, is that what you mean? – IMB Nov 10 '11 at 19:53
  • @IMB maybe phpQuery is not the right tool here. See here for a simple example how to traverse a HTML strucutre using DOM: [PHP code to traverse through a HTML file to find all the images in there?](http://stackoverflow.com/q/3630348) – Pekka Nov 10 '11 at 19:55
  • Actually I was using simplehtmldom before phpQuery but I find phpQuery easier to use but I'm not sure about the speed. I guess I'll try simplehtmldom again. Anyway... with regards to assessing each row, the only way I know is by using the first solution above. I've no idea how can I use `children` for that matter, I hope someone who has used phpQuery can see this. – IMB Nov 10 '11 at 20:15
  • @IMB the idea is simple - 1. Get all the children from the XML parser (instead of calling the XML parser in every iteration). 2. Walk through them using `foreach ($children as $child)`... It *must* be possible with phpQuery but it's really scarce on examples – Pekka Nov 10 '11 at 20:17
  • BTW, the solution above is actually 80% faster, not 20% my bad. I guess that's fast enough. Although I'm still curious about how fast `children` can be if it was indeed possible in phpQuery. – IMB Nov 10 '11 at 20:45