0

I'm trying to get all occurrences of regular expression using preg_match_all and then check if there is particular string in those occurrences. After that, I am trying to count and compare number of occurrences but it seems to me that it is not working. I'm working with HTML data taken from the database, and yes I really need regular expressions for HTML. No matter which data I take from the database the result is following: Image pregmatch count: 2Image search count: 1Table pregmatch count: 2Table search count: 1

This is my code snippet:

$query = $DB->get_field('book_chapters', 'content', array('bookid'=>'1'));

$img_pat = '/<img(.*)\>/i'; //regular expression for image tag search
$table_pat = '/<table(.*)\>/i'; //regular expression for table tag search

echo $query;

$content = serialize($query);

echo $content;

//image
preg_match_all($img_pat, $content, $img_pregmatch);
$img_search = array_search('alt="', $img_pregmatch);

echo 'Image pregmatch count: ' . count($img_pregmatch);
echo 'Image search count: ' . count($img_search);

//table
preg_match_all($table_pat, $content, $table_pregmatch);
$table_search = array_search('summary="', $table_pregmatch);

echo 'Table pregmatch count: ' . count($table_pregmatch);
echo 'Table search count: ' . count($table_search);

And this is example when using rubular.com

rubular.com example

Any help, advice is appreciated, thanks!

Moirae
  • 139
  • 3
  • 14
  • If you do nothing else today, add a lazy `?` like so in `(.*?)` Otherwise you can capture a "super tag" that greedily eats up multiple img tags. – zx81 Apr 21 '14 at 21:20
  • 2
    You're clearly not skilled enough in regex to actually go parse HTML. I suggest you to use [a parser, there's a lot to choose](http://stackoverflow.com/q/3577641). If you don't believe me, let me point out: 1) You most likely need to use a lazy pattern `.*?` instead of greedy `.*` 2) There's no need to escape `>` 3) You might use the `s` modifier 4) Use `regex101.com` that actually supports PCRE 5) `preg_match_all()` produces a multidimensional array, so instead of using `count($img_pregmatch)` you need to use `count($img_pregmatch[0])` – HamZa Apr 21 '14 at 21:21
  • `array_search()` doesn't return an array, it returns the index of the first matching element. Why are you trying to count it? – Barmar Apr 21 '14 at 21:26
  • Moirae you DON'T need to count the matches returned by preg_match_all because the function RETURNS a count (see my answer) – zx81 Apr 21 '14 at 21:31

3 Answers3

2

Try this:

preg_match_all($img_pat, $content, $img_pregmatch, PREG_SET_ORDER);

The default for the options argument is PREG_MATCH_ORDER, so $img_pregmatch[0] is an array of all matches of the whole regexp, $img_pregmatch[N] is an array of all matches of capture group N. So count($img_pregmatch) is just the number of capture groups + 1, not the number of matches.

PREG_SET_ORDER inverts this, so each element of the match array corresponds to a match in the string.

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • There is no need to count the results of preg_match_all because the function returns a count. – zx81 Apr 21 '14 at 21:32
  • Yeah, I noticed that in your answer. But this may be useful depending on what else he's doing with the result. – Barmar Apr 21 '14 at 21:36
  • True, it can be very useful, you're right to draw his attention to it. Not sure if what he needs here is SETs, maybe the best would be if he learns about the flags first. – zx81 Apr 21 '14 at 21:44
1

preg_match_all() will return an array of capture groups. So $img_pregmatch[0] will contain all of your matches and $img_pregmatch[1] will return all of your first capture groups.

Try changing your counts to:

echo 'Image pregmatch count: ' . count($img_pregmatch[0]);
echo 'Table pregmatch count: ' . count($table_pregmatch[0]);

Note:

You shouldn't be using regular expressions to parse HTML, because HTML is not a regular language.

Community
  • 1
  • 1
Sam
  • 20,096
  • 2
  • 45
  • 71
  • Thank you all for your advices and answers , I marked this one as answer because it best suits me. And yes, I know that HTML shouldn't be parsed using regular expressions but this is the solution I'm expected to implement. – Moirae Apr 22 '14 at 08:05
1

First off, there is never any need to count the overall matches of a preg_match_all, because preg_match_all return the number of matches. Therefore you can write:

$count = preg_match_all($regex,$subject,$matches);

Without any more effort, this is the count you are looking for!

Next, you must add a lazy ? like so in (.*?) Otherwise you can capture a "super tag" that greedily eats up multiple img tags.

If you happen to want to know how many Group 1 matches were captured, you could count($matches[1]), but that is not what we are doing here.

preg_match_all is a wonderful function. I recommend you study these usages of preg_match_all to understand the formation of the arrays returned.

zx81
  • 41,100
  • 9
  • 89
  • 105