0

I have this URLs...

$output = "href=\"/one/two/three\"
href=\"one/two/three\"
src=\"windows.jpg\"
action=\"http://www.google.com/docs\"";

When I apply the regular expression:

$base_url_page = "http://mainserver/";
$output = preg_replace( "/(href|src|action)(\s*)=(\s*)(\"|\')(\/+|\/*)(.*)(\"|\')/ismU", "$1=\"" . $base_url_page . "$6\"", $output );

I get this:

$output = "href=\"http://mainserver/one/two/three\"
href=\"http://mainserver/one/two/three\"
src=\"http://mainserver/windows.jpg\"
action=\"http://mainserver/http://www.google.com/docs\"";

How you can modify the regular expression to prevent this: http://mainserver/http://www.google.com/ ???????

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
CRISHK Corporation
  • 2,948
  • 6
  • 37
  • 52
  • 3
    Some advice on HTML parsing with regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Marc B Nov 16 '10 at 02:18
  • Would it suffice to just change the base URI with the [`BASE` element](http://www.w3.org/TR/html4/struct/links.html#edef-BASE)? – Gumbo Nov 16 '10 at 17:18

2 Answers2

1

Try

$output = preg_replace( "/(href|src|action)\s*=\s*["'](?!http)\/*([^"']*)["']/ismU", "$1=\"" . $base_url_page . "$2\"", $output );

I have simplified your regex and added a lookahead that makes sure the string you're matching doesn't start with http. As it is now, this regex allows neither single nor double quotes inside the URL.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • This won’t allow attribute values containing a plain `'` like `href="http://example.com/foo'bar"` (and this *is* a valid URI!). – Gumbo Nov 16 '10 at 17:20
  • I know, that's why I wrote so in my answer. If this is a problem for the OP, the regex can be changed. – Tim Pietzcker Nov 16 '10 at 20:09
  • this solution is great, thanks... the only when the href="/url" ... we will get in the result: href=http://mainserver//url ---> // ?? (In my Regex I resolved this using: (\/+|\/*) ) .... – CRISHK Corporation Nov 17 '10 at 04:16
  • I'm not sure I follow - a URL like `href="/url/foo"` will be transformed into `http://mainserver/url/foo` in my tests. Is this wrong? Could you perhaps edit your question with an example that's currently not working as expected? – Tim Pietzcker Nov 17 '10 at 07:22
0
$output = preg_replace( "/(href|src|action)\s*=\s*[\"'](?!http)(\/+|\/*)([^\"']*)[\"']/ismU", "$1=\"" . $base_url_page . "$3\"", $output );
CRISHK Corporation
  • 2,948
  • 6
  • 37
  • 52