-1

I am working on moving some blog-ish articles to a new third-party home, and need to replace some existing URLs with new ones. I cannot use XML, and am being forced to use a wrapper class that requires this search to happen in regex. I'm currently having trouble regex-ing for the URLs that exist in the html. For example if the html is:

<h1><a href="http://www.website.com/article/slug-that-has-undetermined-amount-of-hyphens/12345">Whatever</a></h1>

I need my regex to return:

http://www.website.com/article/slug-that-has-undetermined-amount-of-hyphens/12345

The beginning part of the URL never changes (the "http://www.website.com/article/" part). However, I have no clue what the slug phrases are going to be, but do know they will contain an unknown about of hyphens between the words. The ID number at the end of the URL could be any integer.

There are multiple links of these types in each article, and there are also other types of URLs in the article that I want to be sure are ignored, so I can't just look for phrases starting with http inside of quotes.

FWIW: I'm working in php and am currently trying to use preg_match_all to return an array of the URLs needed

Here's my latest attempt:

$array_of_urls = [];
preg_match_all('/http:\/\/www\.website\.com\/article\/[^"]*/', $variable_with_html, $array_of_urls);
var_dump($array_of_urls);

And then I get nada dumped out. Any help appreciated!!!

1 Answers1

0

We, StackOverflow volunteers, must insist on enjoying the stability of a dom parser rather than regex when parsing html data.

Code: (Demo)

$html=<<<HTML
<h1><a href="http://www.website.com/article/slug-that-has-undetermined-amount-of-hyphens/12345">Whatever</a></h1>
<p>Here is a url as plain text: http://www.website.com/article/sluggy-slug</p>
<div>Here is a qualifying link: <a href="http://www.website.com/article/slugger-sluggington-jr/666">Whatever</a></div>
HTML;

$dom = new DomDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $item) {
   $output[] = $item->getAttribute('href');
}
var_export($output);

Output:

array (
  0 => 'http://www.website.com/article/slug-that-has-undetermined-amount-of-hyphens/12345',
  1 => 'http://www.website.com/article/slugger-sluggington-jr/666',
)

If for some crazy reason, the above doesn't work for your project and you MUST use regex, this should suffice:

~<a.*?href="\K[^"]+~i  // using case-insensitive flag in case of all-caps syntax

Pattern Demo

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
  • If anything in my answer doesn't work as expected, please offer a realistic sample input to work with and describe what isn't quite right. I'll do my best to adjust my answer or otherwise offer guidance. – mickmackusa Mar 06 '18 at 01:42