0

I hope, someone knows, what is wrong. I made a parser to get all the

<a href="blabla">Link</a>

tags. I test it on http://www.bbc.co.uk/. There are 261 of them on the page I test, and I receive only first 159. I checked it manually, I find every single one from them, but my resulting array has only 159 elements. What is the cause of that limit?

preg_match_all('/<a\s[^\>]*href\=[\'"]?((?:http\:\/\/)?(?:[_\-a-zA-Z0-9\.]*[_a-zA-Z0-9\.\/]))*[\'"]/', $page, $matches);

I checked, curl gives me all the page from

<html>

till

</html>

The problem is to make parser without any DOM usage, just curl and regexp.

Bandydan
  • 623
  • 1
  • 8
  • 24
  • What tags? What are you trying to match? – John Conde Oct 31 '13 at 02:15
  • Did you read this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ? – sectus Oct 31 '13 at 02:19
  • Include more details like all tag info in your page. – Jenson M John Oct 31 '13 at 02:19
  • can't tell you what's wrong if we don't know what you are matching against. But I'm going to go out on a limb and guess you want to get all the links on a page.. have you considered using http://www.php.net/DOM instead? – CrayonViolent Oct 31 '13 at 02:21
  • I added some details. I am looking for anchor tags. I succesfully find them, but not all 261, but first 159 of them. Looks like a limit somewhere. – Bandydan Oct 31 '13 at 02:22
  • sectus, I did. I must find all the links in the site, and I can't use DOM manipulation or any libraries. I suceeded, but there's a limit! – Bandydan Oct 31 '13 at 02:28
  • Are you sure it's the first 159 that you find? If not, find the first link that is not matched and check why it does not match. – jeroen Oct 31 '13 at 02:29
  • I did that already, so I am sure. – Bandydan Oct 31 '13 at 02:31
  • so uh, why can't you use php's DOM class? – CrayonViolent Oct 31 '13 at 02:32
  • also, *you still aren't helping us to help you*. Bottom line is we **can't** tell you what's wrong when we don't know the content you are trying to get. IOW show us the content! – CrayonViolent Oct 31 '13 at 02:32
  • I added the content, but content wasn't matter, as I said. Thanks for trying anyway, guys, appreciate that. – Bandydan Oct 31 '13 at 02:38

1 Answers1

0

OK, I managed to solve this issue by adding some more characters to my regex:

preg_match_all('/<a\s*[^\>]*href\s*\=\s*[\'"]?((?:http\:\/\/)?(?:[_\-a-zA-Z0-9\.]*[\?\=\&_a-zA-Z0-9\.\/]))*[\'"]/', $page, $matches);

I added some spaces symbols like '=', '&' and '?' to be granted in the body of link.

Bandydan
  • 623
  • 1
  • 8
  • 24