0

I'm trying to pull data out of an html file into an array using PHP regex. Below are two rows of the datafile. I want to extract the partnumber (the 9517170 is one example), model, make, and the download URL. Here is my failed regex attempt to extract the part number and URL:

/Row[0|1] ([0-9]+)"(.*?)(\/component[0-9a-zA-Z_:-\/]+)/

Any regex gurus out there that can get me pointed in the right direction?

Thanks!

    <tr id="table_6_row_127" class="fabrik_row oddRow1 9517170">
            <td class="fabrik_row___jos_baseplates___DemcoPart" ><a class='fabrik___rowlink' href='/baseplates/fitlist/details/6/6/127.html'>9517170</a></td>
            <td class="fabrik_row___jos_baseplates___Make" >Subaru</td>
            <td class="fabrik_row___jos_baseplates___Model" >Legacy Outback *4</td>
            <td class="fabrik_row___jos_baseplates___Years" >03-04</td>
            <td class="fabrik_row___jos_baseplates___A" >3</td>
            <td class="fabrik_row___jos_baseplates___B" >25</td>
            <td class="fabrik_row___jos_baseplates___C" >23</td>
            <td class="fabrik_row___jos_baseplates___D" >15 1/2</td>
            <td class="fabrik_row___jos_baseplates___Price" >370</td>
            <td class="fabrik_row___jos_baseplates___Download" ><a href='/component/docman/doc_download/250-tp20170.html' target='_self'>TP20170</a></td>
    </tr>
<tr id="table_6_row_431" class="fabrik_row oddRow0 9518272">
            <td class="fabrik_row___jos_baseplates___DemcoPart" ><a class='fabrik___rowlink' href='/baseplates/fitlist/details/6/6/431.html'>9518272</a></td>
            <td class="fabrik_row___jos_baseplates___Make" >Subaru</td>
            <td class="fabrik_row___jos_baseplates___Model" >Outback *4*9</td>
            <td class="fabrik_row___jos_baseplates___Years" >10-11</td>
            <td class="fabrik_row___jos_baseplates___A" >3</td>
            <td class="fabrik_row___jos_baseplates___B" >30</td>
            <td class="fabrik_row___jos_baseplates___C" >25-1/8"</td>
            <td class="fabrik_row___jos_baseplates___D" >17-1/4"</td>
            <td class="fabrik_row___jos_baseplates___Price" >370</td>
            <td class="fabrik_row___jos_baseplates___Download" ><a href='http://demco-products.com/component/docman/doc_download/921-tp20272.html' target='_self'>tp20272</a></td>
    </tr>
user77413
  • 30,205
  • 16
  • 46
  • 52
  • 1
    See http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – Rob H Mar 02 '11 at 23:52
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – mmmmmm Aug 03 '11 at 07:53

1 Answers1

2

Use DOMDocument::loadHTML? It uses libxml under the hood which is fast and robust.

Don't try to parse HTML with regex's.

I made that bold because I see it a lot on here and the solutions are always fragile at best and buggy at worst. Once you use a true HTML parser to get the attributes you want then using a regex is more reasonable.

Andrew White
  • 52,720
  • 19
  • 113
  • 137
  • I've read the documentation on loadHTML(), but it is not at all clear how I can use that function to put the variables I want into a PHP array. There also do not seem to be any examples out there for extracting tabular data using that function. Anyone know of a good tutorial on this? – user77413 Mar 03 '11 at 00:22
  • I believe you can use xpaths to get an array of tags of a certain type which is only one step away from what you want. – Andrew White Mar 03 '11 at 00:49