0

I am working on a project where i need to parse and manipulate HTML. I have a requirement of replacing 'Base Url' in the HTML string. I am trying to use RegEx for the purpose. I have tried multiple patterns, but no luck. Below is my current code -

<?php
$html = '<html><head><base href="/" /></head><body></body></html>';
$base = 'https://SOME_URL/';

$output = preg_replace('/<base href="(.+)">/', $base, $html);

print $output;

Current Output -

$html = '<html><head><base href="/" /></head><body></body></html>';

Expected Output -

$html = '<html><head><base href="https://SOME_URL/" /></head><body></body></html>';

Vivek Srivastava
  • 569
  • 4
  • 13

2 Answers2

0

Your regex - <base href="(.+)">, is not matching because the part after "(.+)" is wrong. Look at the source string - <base href="/" />, see that ? and the /? and then the >.

This is just one of the many reasons parsing HTML with regex is a bad idea. That element is perfectly valid even without that space and maybe even without that /.

However, if you're 100% positive that the position of this base element won't get too complex (e.g lot of nesting, new lines between the attributes etc). You may be able to get by with just - /<base[ ]*?href=".+"/i

Check out the demo

In PHP, to get your expected output, you'd do-

$base = 'https://SOME_URL/';

$output = preg_replace('/(<base[ ]*?href=").+(")/', "$1$base$2", $html);
Chase
  • 5,315
  • 2
  • 15
  • 41
0

Try this pattern

(?<=<base\s)href="(.*?)"

Check out the demo

  $html = '<html><head><base href="/" /></head><body></body></html>';
  $base = 'https://SOME_URL/';
  res=$html.replace(/(?<=base\s)href="([^"]*)"/,`"${$base}"`)
  console.log(res)
Sven.hig
  • 4,449
  • 2
  • 8
  • 18
  • `[` in the line, see https://regex101.com/r/GKxaIZ/2 – Toto Jul 25 '20 at 10:23
  • £Toto I have fixed it do you think it needs anymore adjustments ? – Sven.hig Jul 25 '20 at 10:49
  • Why are you using a lookbehind, it's use less here? And why a character class `[ – Toto Jul 25 '20 at 12:01
  • OP want to change base urls !! – Sven.hig Jul 25 '20 at 12:04
  • Well, but, how do you change it? Where is the code? And you didn't answer my previous questions: Why are you using a lookbehind? Why a character class `[ – Toto Jul 25 '20 at 12:17
  • do you know of a method to match a link in base tag without using something like a lookbehind ? – Sven.hig Jul 25 '20 at 12:28
  • Look at the other answer! – Toto Jul 25 '20 at 12:30
  • I have looked at the other answer I just don't see any reason to match as well, maybe you can enlighten me into why you think it should be matched, since php is not supported here to run the code I have made a snippet in `js` replacing the base url do you think the same can't be achieved in php? – Sven.hig Jul 25 '20 at 12:50
  • and personally I think DOM parser is the best solution to this question – Sven.hig Jul 25 '20 at 12:53
  • 1
    That's better, you've completly changed the regex. I agree with you for parser, that's why I've closed the question. – Toto Jul 25 '20 at 12:57
  • Thank you for the feedbacks , they were constructive :) – Sven.hig Jul 25 '20 at 13:00