Get part of the regex

Question

I'm trying to parse an HTML page and get a specific data (with PHP). This is my regex :

$pattern = '/class=\"group\">.*[\n\r]*.*[\n\r]*.*[\n\r]*.*/';
preg_match_all($pattern, $subject, $matches);

And this is what I find (highlighted in yellow):

enter image description here

<NOBR>םושיר&nbsp;לטב<input type="checkbox" name="DEL104004"
onClick="UPG104004.selectedIndex=0"></NOBR></TD>
<TD class="group">22</TD>
<TD class="points">5.0</TD>
<TD>some text</TD>
<TD><A HREF="http://www.website.com/mk.php?MK=104004" class="mk">104004</A></TD>
</TR>
<TR ALIGN=RIGHT BGCOLOR=#FFCC33>
<TD COLSPAN=2><BR></TD>
<TD>5.0</TD>

But actually all I need is the data circled in red (22, 104004). Can I do it with a regex?

MORE INFO

I can assume that this particular structure won't change. The HTML is mostly a table with few rows, some of them contains the data I want to get (group number and MK number).

[You can't parse HTML with regex.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — naththedeveloper, Feb 17 '14 at 09:03
@FDL I'm not trying to parse all the HTML. I know I can't do it. I'm trying to get a specific data that can match a regex. — Itay Gal, Feb 17 '14 at 09:04
@FDL That's not quite so. Better to say: "One can't parse HTML with regex efficiently". — hindmost, Feb 17 '14 at 09:07
@hindmost there is only a `...` nothing special and I can't count on it, only on the data that I showed -> starts with a `class="group"` and ends after 3 more lines. — Itay Gal, Feb 17 '14 at 09:12
@ItayGal Do you want to: find ``s with class name "group", and the last `` in the same ``? — Passerby, Feb 17 '14 at 09:22
Not exactly. It's in the same `` and I already found that expression but I only need the data inside this specific expression. — Itay Gal, Feb 17 '14 at 09:25
@ItayGal Yes I know you need the data inside those ``s -- I was asking if those are the ``s you're targeting. — Passerby, Feb 17 '14 at 09:30
@ItayGal please post the HTML and define clearly the logic behind the matches. You need to make it easier for us to help you. You're just making it hard by using an image. — HamZa, Feb 17 '14 at 09:35
@ItayGal, could you post HTML, rather than image, for testing? — sinisake, Feb 17 '14 at 09:35
@ItayGal According to your updated info... http://3v4l.org/NYTHe ? — Passerby, Feb 17 '14 at 10:07

score 4 · Accepted Answer · answered Feb 17 '14 at 10:14

Per your updated info ( ...the data I want to get (group number and MK number) ), you can simply done with an XPath:

$dom=new DOMDocument("1.0","UTF-8");
$dom->loadHTML($html);
$xpath=new DOMXPath($dom);
foreach($xpath->query('//td[@class="group" or @class="mk"]') as $node)
{
    echo $node->attributes->getNamedItem("class")->nodeValue; /* class name */
    echo ": ";
    echo $node->textContent; /* data */
    echo "\n";
}

Online demo

No line-break/line number traps.

Although I asked about a regex, I might be using this solution because it's seems a better and easier alternative. — Itay Gal, Feb 17 '14 at 10:54

Robin · Answer 2 · 2014-02-17T11:10:22.047

3

Well if your HTML is constant, always this pattern, you can use an easy to break regex:

$pattern = '/(?:class="group"[^>]*>|class="mk"[^>]*>)\s*(\d+)/'
preg_match_all($pattern, $subject, $matches);

This will catch all digits after the wanted class markups in the capturing group (ie in $matches[1]). Obviously, this is just a quick & dirty solution as just a few modifications in the HTML would break it: but since you said this was for a very limited use... (if it is susceptible to change you should really consider an HTML parser solution)

Some explanation

(\d+): \d is a shortcut for [0-9], and the parenthesis are a capturing group. Capturing group allows you to store what it matches in variable, so that it can be reused in the same regex, or it can be extracted later. Here, the first capturing group results will be stored in $matches[1].
(?:...): this structure is a non-capturing group. It allows you to use parenthesis to group patterns, without capturing them. Which allows you to only store what you want.
|: the pipe mean or
[^...] means anything but what's inside the square brackets (the ^ is a special character inside these brackets)
\s is a shortcut for any kind of whitespace (newline, tab, whitespace...)

edited Feb 17 '14 at 11:10

answered Feb 17 '14 at 09:34

Robin

9,415
3
34
45

The HTML contains line breaks and without `[\n\r]` my regex didn't work. – Itay Gal Feb 17 '14 at 09:46
as you haven't posted your regex (post your regex!) I can only assume you aren't using `preg_match_all` which will automatically parse new lines (`\n\r`) - you should. – scrowler Feb 17 '14 at 09:51
1

@ItayGal: Okay, well just FYI you can use the `s` flag to allow the wildcard `.` to match linebreak too: `/heres_my_regex/s` @scrowler: I believe he did post it, first three lines of the question :) – Robin Feb 17 '14 at 09:51
@Robin as I understand your regex only targets the class part and the number after it. It's better, but it still returns the `class="..` text and not only the number. I know I can extract the number easily but I thought there might be a way of getting the number in one regex. – Itay Gal Feb 17 '14 at 10:52
1

@ItayGal: Yep, the regex targets both. But the parenthesis in `(\d+)` are a *capturing group*, which means the value they match will be stored in a different variable: here you should look at what's in `$matches[1]`, which contains what's matched by the first capturing group. – Robin Feb 17 '14 at 11:03
@Robin, thank you for the answer, it was educating and a good one so I voted for you. Yet, I decided to accept Passerby's answer because the solution he suggested, even if not using regex, was easier to use for more complex things. – Itay Gal Feb 17 '14 at 21:38
@ItayGal: That's the way StackOverflow is intended to work, his answer is more suited indeed :) – Robin Feb 17 '14 at 21:41

Get part of the regex

2 Answers2