Onetrickpony got to the heart of what's wrong with your regex: your numeric ID has multiple digits, but your regex only matches a single digit.
There are some other things I'm going to throw out there for your consideration. First, if there are other attributes in your <a>
tag, your regex will fail. For example, if there's a target="_blank"
attribute, it will mess up your regex. Fortunately, there's an easy way around that:
preg_match_all('/<a .*?href="article\.html\?id=([0-9]+)".*?>(.*?)<\/a>/',
$webpage, $match);
Essentially, I just padded the href
attribute with .*?
. The question mark makes the matches lazy (instead of the default of greedy), which will prevent it from consuming more than you want. I also replaced your [^<]
with a lazy match, because I generally find it a little cleaner.
UPDATE: As demonking correctly pointed out, the period and question mark in article.html?id=
need to be escaped. The period doesn't matter so much, except that leaving it in there would match article_html
or anything else, which is probably not a concern. However, not escaping the question mark is trouble. It makes the l
in html
optional, but then there's nothing to actually match the question mark, which is probably why my uncorrected solution was failing. Thanks, demonking!