1

I'm stumped. I've got two regexs that are doing what I need individually, but I'm not sure how to get them to work in conjunction.

\b([a-zA-Z])?\d{5}\b is correctly finding strings in the pattern of an optional single letter followed by 5 digits.

<a\s+(?:[^>]*?\s+)?href="([^"]*)" is matching the URL in an anchor tag.

Now what I want to match on (for replacement purposes) is the 5 digit number (with or without the preceding letter) that occur within the URL of an anchor tag.

Sample content:

<a href="/uploads/2014/04/Draft-99990-Details.doc">Draft 99995 Details</a> <a href="/uploads/2014/04/01090-vs-G01010-series.pdf">01095 vs G01015 Series</a>

There should be 3 matches in this text, the 3 numbers ending in 0 and not those ending in 5.

  • What is the "link portion" of an anchor tag. Is that the actual URL or the text inside the tag (I'm assuming the URL)? – Sam Apr 28 '14 at 18:32
  • The URL, what's between the quotes after "href=". Sorry I worded that strangely. Fixed now. – Sandbox Wizard Apr 28 '14 at 18:44

2 Answers2

1

Split the task into two. First, retrieve all the href attribute contents using a DOM parser such as PHP's DOMDocument, and then use a regular expression to replace the specific part. The advantage of this method over a single regular expression is that, it won't break even if the format of your markup changes in future.

$html = <<<HTML
<a href="/uploads/2014/04/Draft-99990-Details.doc">Draft 99995 Details</a>
<a href="/uploads/2014/04/01090-vs-G01010-series.pdf">01095 vs G01015 Series</a>
HTML;

$dom = new DOMDocument;
$dom->loadHTML($html);

$replacement = 'FOO';
$html = '';

foreach ($dom->getElementsByTagName('a') as $node) {
    $href = $node->getAttribute('href');
    $node->setAttribute('href', preg_replace('/([a-z])?\d{5}/i', $replacement, $href));
    $html .= $dom->saveHTML($node);
}

echo $html;

Output:

<a href="/uploads/2014/04/Draft-FOO-Details.doc">Draft 99995 Details</a>
<a href="/uploads/2014/04/FOO-vs-FOO-series.pdf">01095 vs G01015 Series</a>

Demo

Amal Murali
  • 75,622
  • 18
  • 128
  • 150
  • Finally getting some time to come back to this project. I've got your solution doing what I need it to, but how do I handle any HTML that's not a link? ie. if I have 'some content here Draft 99995 Details more text here 01095 vs G01015 Series and some final text here'? – Sandbox Wizard May 20 '14 at 19:19
  • @SandboxWizard: I'm not sure what you mean by "handle". Are you trying to have it work with invalid HTML too? If so, what's the expected output in the above case? – Amal Murali May 21 '14 at 02:23
  • To keep my example in the comment and the replacement from your example. The expected output is: 'some content here Draft 99995 Details more text here 01095 vs G01015 Series and some final text here'. So if I pass in the HTML content for a page I want to get that same content back, but with the replacements made in the anchor tags. Did that make sense? – Sandbox Wizard May 21 '14 at 04:53
  • `Draft 99995 Details` is not valid HTML in the first place. How do you expect it to work? Is that a typo? Or just an intentional mistake? I'm not sure what you're asking. To me, it looks like a different problem. Maybe you want to post a new question explaining all the details and reference this one if required, @SandboxWizard. – Amal Murali May 21 '14 at 04:55
  • Sorry, that's a typo. The content in my first comment should be: 'some content here Draft 99995 Details more text here 01095 vs G01015 Series and some final text here' The HTML being passed I'm creating, so it will be valid. – Sandbox Wizard May 21 '14 at 05:09
  • @SandboxWizard I'm confused. The expected output doesn't make much sense. Can we take this to chat? – Amal Murali May 21 '14 at 05:11
  • As I understand it the 'foreach' in your example code is pulling out all the anchor tags and doing the replace, then saving the HTML for that anchor tag to $html. Any other HTML in the original $html will be lost, correct? – Sandbox Wizard May 21 '14 at 05:11
  • @SandboxWizard: Yes, that is correct. This code currently only modifies the `href` attribute. If you want the entire document to be output as it is, except with the `href`s changed, you could simply use this: https://eval.in/154617 – Amal Murali May 21 '14 at 05:18
0

This expression, should do the trick.

(?:<a\s+(?:[^>]*)?href="|(?!^)\G)\K.*?([A-Z]?\d{5})(?=.*?")

Explanation:

(?:                         # BEGIN non-capturing group
    <a\s+(?:[^>]*)?href="   # Anchor tag up until the href attribute
  |                         # OR
    (?!^)\G                 # \G finds the end of the last match
)                           # END non-capturing group
\K                          # Start match over (remove anchor tag from match)
.*?                         # Lazily match the URL
([A-Z]?\d{5})               # Capture an optional letter followed by 5 digits
(?=                         # BEGIN look ahead
    .*?"                    # Lazily match to the end of the URL
)                           # END look ahead

This is done with the g and i modifier for a global, case-insensitive match. Please note that this will only "match" to the end of the capture group (instead of the end of the URL). This is because we have to use \G to find the end of the last match. If we match the entire URL, then the \G will start over at the end of the URL and we will miss some groups.

Hat tip to Casimir's answer.

Community
  • 1
  • 1
Sam
  • 20,096
  • 2
  • 45
  • 71
  • In the unlikely event that the `href` attribute contains a `>` character, this would fail: ` – Amal Murali Apr 28 '14 at 19:03
  • In that case, you can just replace `[^>]*` with a lazy match `.*?`. But this does drive home the point that HTML [is not a regular language](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Sam Apr 28 '14 at 19:05