0

What I've been trying to do recently is to extract listing information from a given html file,

For example, I have an html page that has a list of many companys, with their phone number, address, etc'

Each company is in it's own table, every table started like that: <table border="0">

I tried to use PHP to get all of the information, and use it later, like put it in a txt file, or just import into a database.

I assume that the way to achieve my goal is by using regex, which is one of the things that I really have problems with in php,

I would appreciate if you guys could help me here. (I only need to know what to look for, or atleast something that could help me a little, not a complete code or anything like that)

Thanks in advance!!

Brad Christie
  • 100,477
  • 16
  • 156
  • 200
Dan
  • 1
  • 1

3 Answers3

5

I recommend taking a look at the PHP DOMDocument and parsing the file using an actual HTML parser, not regex.

There are some very straight-forward ways of getting tables, such as the GetElementsByTagName method.


<?php

  $htmlCode = /* html code here */

  // create a new HTML parser
  // http://php.net/manual/en/class.domdocument.php
  $dom = new DOMDocument();

  // Load the HTML in to the parser
  // http://www.php.net/manual/en/domdocument.loadhtml.php
  $dom->LoadHTML($htmlCode);

  // Locate all the tables within the document
  // http://www.php.net/manual/en/domdocument.getelementsbytagname.php
  $tables = $dom->GetElementsByTagName('table');

  // iterate over all the tables
  $t = 0;
  while ($table = $tables->item($t++))
  {
    // you can now work with $table and find children within, check for
    // specific classes applied--look for anything that would flag this
    // as the type of table you'd like to parse and work with--then begin
    // grabbing information from within it and treating it as a DOMElement
    // http://www.php.net/manual/en/class.domelement.php
  }
Brad Christie
  • 100,477
  • 16
  • 156
  • 200
1

If You're familiar with jQuery (and even if You're not as it's command are simple enough) I recommend this PHP counterpart: http://code.google.com/p/phpquery/

Jacek Kaniuk
  • 5,229
  • 26
  • 28
0

If your HTML is valid XML, as in XHTML, then you could parse it using SimpleXML

Panman
  • 1,157
  • 2
  • 8
  • 19