1

I'm trying to parse an HTML page and get a specific data (with PHP). This is my regex :

$pattern = '/class=\"group\">.*[\n\r]*.*[\n\r]*.*[\n\r]*.*/';
preg_match_all($pattern, $subject, $matches);

And this is what I find (highlighted in yellow):

enter image description here

<NOBR>םושיר&nbsp;לטב<input type="checkbox" name="DEL104004"
onClick="UPG104004.selectedIndex=0"></NOBR></TD>
<TD class="group">22</TD>
<TD class="points">5.0</TD>
<TD>some text</TD>
<TD><A HREF="http://www.website.com/mk.php?MK=104004" class="mk">104004</A></TD>
</TR>
<TR ALIGN=RIGHT BGCOLOR=#FFCC33>
<TD COLSPAN=2><BR></TD>
<TD>5.0</TD>

But actually all I need is the data circled in red (22, 104004). Can I do it with a regex?

MORE INFO

I can assume that this particular structure won't change. The HTML is mostly a table with few rows, some of them contains the data I want to get (group number and MK number).

Itay Gal
  • 10,706
  • 6
  • 36
  • 75

2 Answers2

4

Per your updated info ( ...the data I want to get (group number and MK number) ), you can simply done with an XPath:

$dom=new DOMDocument("1.0","UTF-8");
$dom->loadHTML($html);
$xpath=new DOMXPath($dom);
foreach($xpath->query('//td[@class="group" or @class="mk"]') as $node)
{
    echo $node->attributes->getNamedItem("class")->nodeValue; /* class name */
    echo ": ";
    echo $node->textContent; /* data */
    echo "\n";
}

Online demo

No line-break/line number traps.

Passerby
  • 9,715
  • 2
  • 33
  • 50
  • Although I asked about a regex, I might be using this solution because it's seems a better and easier alternative. – Itay Gal Feb 17 '14 at 10:54
3

Well if your HTML is constant, always this pattern, you can use an easy to break regex:

$pattern = '/(?:class="group"[^>]*>|class="mk"[^>]*>)\s*(\d+)/'
preg_match_all($pattern, $subject, $matches);

This will catch all digits after the wanted class markups in the capturing group (ie in $matches[1]). Obviously, this is just a quick & dirty solution as just a few modifications in the HTML would break it: but since you said this was for a very limited use... (if it is susceptible to change you should really consider an HTML parser solution)

Some explanation

  • (\d+): \d is a shortcut for [0-9], and the parenthesis are a capturing group. Capturing group allows you to store what it matches in variable, so that it can be reused in the same regex, or it can be extracted later. Here, the first capturing group results will be stored in $matches[1].
  • (?:...): this structure is a non-capturing group. It allows you to use parenthesis to group patterns, without capturing them. Which allows you to only store what you want.
  • |: the pipe mean or
  • [^...] means anything but what's inside the square brackets (the ^ is a special character inside these brackets)
  • \s is a shortcut for any kind of whitespace (newline, tab, whitespace...)
Robin
  • 9,415
  • 3
  • 34
  • 45
  • The HTML contains line breaks and without `[\n\r]` my regex didn't work. – Itay Gal Feb 17 '14 at 09:46
  • as you haven't posted your regex (post your regex!) I can only assume you aren't using `preg_match_all` which will automatically parse new lines (`\n\r`) - you should. – scrowler Feb 17 '14 at 09:51
  • 1
    @ItayGal: Okay, well just FYI you can use the `s` flag to allow the wildcard `.` to match linebreak too: `/heres_my_regex/s` @scrowler: I believe he did post it, first three lines of the question :) – Robin Feb 17 '14 at 09:51
  • @Robin as I understand your regex only targets the class part and the number after it. It's better, but it still returns the `class="..` text and not only the number. I know I can extract the number easily but I thought there might be a way of getting the number in one regex. – Itay Gal Feb 17 '14 at 10:52
  • 1
    @ItayGal: Yep, the regex targets both. But the parenthesis in `(\d+)` are a *capturing group*, which means the value they match will be stored in a different variable: here you should look at what's in `$matches[1]`, which contains what's matched by the first capturing group. – Robin Feb 17 '14 at 11:03
  • @Robin, thank you for the answer, it was educating and a good one so I voted for you. Yet, I decided to accept Passerby's answer because the solution he suggested, even if not using regex, was easier to use for more complex things. – Itay Gal Feb 17 '14 at 21:38
  • @ItayGal: That's the way StackOverflow is intended to work, his answer is more suited indeed :) – Robin Feb 17 '14 at 21:41