preg_match_all function gives me first 159 results from possible 261

Question

I hope, someone knows, what is wrong. I made a parser to get all the

<a href="blabla">Link</a>

tags. I test it on http://www.bbc.co.uk/. There are 261 of them on the page I test, and I receive only first 159. I checked it manually, I find every single one from them, but my resulting array has only 159 elements. What is the cause of that limit?

preg_match_all('/<a\s[^\>]*href\=[\'"]?((?:http\:\/\/)?(?:[_\-a-zA-Z0-9\.]*[_a-zA-Z0-9\.\/]))*[\'"]/', $page, $matches);

I checked, curl gives me all the page from

<html>

till

</html>

The problem is to make parser without any DOM usage, just curl and regexp.

Did you read this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ? — sectus, Oct 31 '13 at 02:19
can't tell you what's wrong if we don't know what you are matching against. But I'm going to go out on a limb and guess you want to get all the links on a page.. have you considered using http://www.php.net/DOM instead? — CrayonViolent, Oct 31 '13 at 02:21
I added some details. I am looking for anchor tags. I succesfully find them, but not all 261, but first 159 of them. Looks like a limit somewhere. — Bandydan, Oct 31 '13 at 02:22
sectus, I did. I must find all the links in the site, and I can't use DOM manipulation or any libraries. I suceeded, but there's a limit! — Bandydan, Oct 31 '13 at 02:28
Are you sure it's the first 159 that you find? If not, find the first link that is not matched and check why it does not match. — jeroen, Oct 31 '13 at 02:29
also, *you still aren't helping us to help you*. Bottom line is we **can't** tell you what's wrong when we don't know the content you are trying to get. IOW show us the content! — CrayonViolent, Oct 31 '13 at 02:32
I added the content, but content wasn't matter, as I said. Thanks for trying anyway, guys, appreciate that. — Bandydan, Oct 31 '13 at 02:38

score 0 · Accepted Answer · answered Oct 31 '13 at 13:52

OK, I managed to solve this issue by adding some more characters to my regex:

preg_match_all('/<a\s*[^\>]*href\s*\=\s*[\'"]?((?:http\:\/\/)?(?:[_\-a-zA-Z0-9\.]*[\?\=\&_a-zA-Z0-9\.\/]))*[\'"]/', $page, $matches);

I added some spaces symbols like '=', '&' and '?' to be granted in the body of link.

preg_match_all function gives me first 159 results from possible 261

1 Answers1