Extract content from each first TD in a Table

Question

I've got some HTML that looks like this:

<tr class="row-even">
    <td align="center">abcde</td>
    <td align="center"><a href="deluserconfirm.html?user=abcde"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></a></td>
</tr>
<tr class="row-odd">
    <td align="center">efgh</td>
    <td align="center"><a href="deluserconfirm.html?user=efgh"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></a></td>
</tr>
<tr class="row-even">
    <td align="center">ijkl</td>
    <td align="center"><a href="deluserconfirm.html?user=ijkl"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></a></td>
</tr>

And I need to retrieve the values, abcde, efgh, and ijkl

This is the regex I'm currently using:

preg_match_all('/(<tr class="row-even">|<tr class="row-odd">)<td align="center">(.*)<\/td><\/tr>/xs', $html, $matches);

Yes, I'm not very good at them. As with most of my regex attempts, this is not working. Can anyone tell me why?

Also, I know about html/xml parsers, but it would require a significant code revisit to make that happen. So that's for later. We need to stick with regex for now.

EDIT: To clarify, I need the values between the first <td align="center"></td> tag after either <tr class="row-even"> or <tr class="row-odd">

I'm not so good with regular expression but could it be that you're missing a portion for the line break between ` — JohnoBoy, Oct 19 '10 at 07:13
please tell us what do you want to do exactly? what is the function of this? — klox, Oct 19 '10 at 07:14
@JohnoBoy: How do I enter the linebreaks? @klox: I need the values between the first tag — HyderA, Oct 19 '10 at 07:17
You shouldn’t try regular expressions; use a proper HTML parser instead. — Gumbo, Oct 19 '10 at 07:21
To complete @Gumbo comment : http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Colin Hebert, Oct 19 '10 at 07:23
@Gumbo: As I've mentioned, I already know that. I'd rather fix this bug now than rewrite entire modules. That task is scheduled for the next release. For now, we need to get this up and running. — HyderA, Oct 19 '10 at 07:23
@gAMBOOKa: Don't sweat it — we're known to be really naggy here :P — BoltClock, Oct 19 '10 at 07:44

jensgram · Accepted Answer · 2010-10-19T07:39:58.767

2

~<tr class="row-(even|odd)">\s*<td align="center">(.*?)</td>~m

Notice the m modifier and the use of \s*.

Also, you can make the first group non-capturing via ?:. I.e., (?:even|odd) as you're probably not interested in the class attribute :)

edited Oct 19 '10 at 07:39

answered Oct 19 '10 at 07:31

jensgram

31,109
6
81
98

Finally! Someone not arguing over regex v/s html parsers! I tried it and it works perfect. Just some clarification please, I tried the \s before and it didn't work with the *. Why is the * needed? Also, what do the ~ characters do? – HyderA Oct 19 '10 at 07:36
In PHP you can use any character to mark the beginning and the end of your regex. He chose `~` for convenience. The `*` is a quantifier. You use it to say that you want between 0 and infinity of a certain class. `\s` in your case, which means space characters. – Alin Purcaru Oct 19 '10 at 07:39
@gAMBOOKa What @Alin Purcaru said :) The `~` is chosen since it is not used elsewhere in my pattern. You often see `/` used as delimiter but that would force me to escape it `\/` in the `` part. Regarding `\s`: It will match a space, a tab or a line break (zero-to-many). – jensgram Oct 19 '10 at 07:43

score 2 · Answer 2 · answered Oct 19 '10 at 07:36

Try this:

preg_match_all('/(?:<tr class="row-even">|<tr class="row-odd">).<td align="center">(.*?)<\/td>/s', $html, $matches);

Changes made:

You've not accounted for the newline between the tags
You don't need to x modifier as it will discard the space in the regex.
Make the matching non-greedy by using .*? in place of .*.

Working link

score 2 · Answer 3 · answered Oct 19 '10 at 07:46

2

Actually, you dont need a too big change in your codebase. Fetching Text Nodes is always the same with DOM and XPath. All that does change is the XPath, so you could wrap the DOM code into a function that replaces your preg_match_all. That would be just a tiny change, e.g.

include_once "dom.php";
$matches = dom_match_all('//tr/td[1]', $html);

where dom.php just contains:

// dom.php
function dom_match_all($query, $html, array $matches = array()) {
    $dom = new DOMDocument;
    libxml_use_internal_errors(TRUE);
    $dom->loadHTML($html);
    libxml_clear_errors();
    $xPath = new DOMXPath($dom);
    foreach( $xPath->query($query) as $node ) {
        $matches[] = $node->nodeValue;
    }
    return $matches;
}

and would return

Array
(
    [0] => abcde
    [1] => efgh
    [2] => ijkl
)

But if you want a Regex, use a Regex. I am just giving ideas.

answered Oct 19 '10 at 07:46

Gordon

312,688
75
539
559

I appreciate your effort, and it's a valid response except in it's a lot more complicated in my case. I plan on using the simplehtmldom library, which I've found to be pretty slick. This application is for all practical reasons, a crawler. So there are tonnes of regexes spread out throughout the application. Simply including a new library is an effort because there's no central library inclusion class. I'll have multiple copies of code throughout the codebase if I reuse the current architecture. But I see your point, and I'm sure it will help someone looking for a similar solution. – HyderA Oct 19 '10 at 07:54
@gAMBOOKa no problem. You might also be interested in [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662). IMO there is better libraries than SimpleHTMLDom. – Gordon Oct 19 '10 at 07:57

Swiss · Answer 4 · 2010-10-19T07:45:04.953

This is just a quick and dirty regex to meet your needs. It could easily be cleaned up and optimized, but it's a start.

<tr[^>]+>[^\n]*\n               #Match the opening <tr> tag
  \s*<td[^>]+>([^<]+)[^\n]+\n   #Group the wanted data
  [^\n]+\n                      #Match next line
</tr>                           #Match closing tag

Here is an alternative way, which may be more robust:

deluserconfirm.html\?user=([^"]+)

score 0 · Answer 5 · answered Oct 19 '10 at 07:34

0

This is what I came up with

<td align="center">([^<]+)</td>

I'll explain. One of the challenges here is what's between the tags could be either the text you're looking for, or an tag. In the regex the [^<]+ says to match one or more characters that is not the < character. That's great, because that means the won't match, and the the group will only match until the tag is found.

answered Oct 19 '10 at 07:34

mellowsoon

22,273
19
57
75

Just noticed that in my answer my anchor tags were stripped out. – mellowsoon Oct 19 '10 at 07:41

score 0 · Answer 6 · answered Oct 19 '10 at 07:36

0

Disclaimer: Using regexps to parse HTML is dangerous.

To get the innerhtml of the first TD in each TR, use this regexp:

/<tr[^>]*>\s*<td[^>]>(.+?)<\/td>/si

answered Oct 19 '10 at 07:36

W3Coder

620
8
20

Extract content from each first TD in a Table

6 Answers6

Linked