1

I have some basic HTML which I am calling str_replace() on, I need to append all URLs found within an HTML string with a 'generate_book.php?link=', but I need to exclude any external links, eg;

<a href="gst/3.html">Link</a> -- this should become -- <a href="generate_book.php?link=gst/3.html"></a>

<a href="http://example.com">Link</a> -- this should be left alone

Your brain powa is appreciated!

danjah
  • 2,939
  • 2
  • 30
  • 47

2 Answers2

1

You'll want to use a look-ahead at the beginning to make sure it does not match HTTP or HTTPS. You could also add mailto if you are worried about it.

$str = preg_replace("/(?<=href=\")(?!http:\/\/|https:\/\/)([^\"]+)/i", "generate_book.php?link=$1", $str);

This regex also uses a look-behind ( the (?<=href=\")) so that it doesn't actually match the href=".

Warnings:

  • Need to be aware of which URL schemes will be in the HTML besides HTTP and HTTPS, if any.
  • Some tags like the link tag also have an href attribute. Make sure you aren't replacing these. If you need to match only A tags by using Regex, your regex complexity will grow considerably and still won't really be safe.
  • Regex Eval is much less efficient and unsafe, but if you need URL encoding, you can attempt to URL encode it at replace time like the second return of the other answer does.
  • Overall, Regex is not necessarily the best solution for this. You might be better off with an HTML parser...
Community
  • 1
  • 1
Nicole
  • 32,841
  • 11
  • 75
  • 101
  • Ok, I've tried all suggestions on this page, but they're all generating errors. Yours seems to generate the least with: Warning: preg_replace() [function.preg-replace]: Unknown modifier '/' in C:\wamp\www\projects\kineo\taxteam\[CD_COURSE] GCSB (Central)\htdocs\book\generate_book.php on line 64 – danjah Nov 30 '10 at 04:53
  • Oops, replace both `//` after the http/https with `\/\/` I wasn't using PHP to test the regex so I didn't have to escape them. – Nicole Nov 30 '10 at 05:20
  • That works very nicely for me, thanks, I didn't want to scare anyone off with the context in which I'm using it, but its for a Server2Go package burnt to CDROM so there's very little danger involved, just broken links :) – danjah Nov 30 '10 at 20:51
0

Give this a try:

$str = preg_replace(
    "(href=\"([^\"]+)\")ie",
    "if(substr('$1',0,7) == 'http://')
        return stripslashes('$1');
     else
        return 'generate_book.php?link='.urlencode(stripslashes('$1'));",
    $str);
Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • I'm pretty sure you can't just pass a function as a string there. What you want is `preg_replace_callback()` and `create_function()` (or use an anonymous function if using > 5.3). – alex Nov 30 '10 at 01:48
  • The `e` modifier makes the `replace` parameter be evaluated as PHP code [PHP.net PCRE modifiers](http://uk3.php.net/manual/en/reference.pcre.pattern.modifiers.php) – Niet the Dark Absol Nov 30 '10 at 01:52
  • Sorry to be a n00b, @alex could you please provide me an example code of what you mean? I think I follow, but I'm also a complete novice :) – danjah Nov 30 '10 at 01:53
  • @Kolink Oh missed that. I've always avoided it, because it always looked like a security issue to me. – alex Nov 30 '10 at 02:39
  • If you change the regex from `[^\"]` to check for valid URL characters, there won't be a security risk because the characters are harmless - and it's escaped in a string anyway. – Niet the Dark Absol Nov 30 '10 at 02:45
  • I'd still be curious to know how your suggestion works Kolink, if you want to n00b it down for me! – danjah Nov 30 '10 at 20:53
  • Using the above regex, which seems to be more awesome than i first thought because of the string method call, fails on me currently: "Fatal error: preg_replace() [function.preg-replace]: Failed evaluating code: if(substr('47.html',0,7) == 'http://') return stripslashes('47.html'); else return 'generate_book.php?link='.urlencode(stripslashes('47.html'));" – danjah Dec 01 '10 at 02:59
  • That's a new one on me... o_O I mean, I've had some weird PHP errors before, but that one's the worst :D After all, the code is valid... – Niet the Dark Absol Dec 01 '10 at 03:04
  • I thought I got it to work just now, I had a guess thanks to your comment re: 'e' - so I passed in 'e' as a 3rd arg to preg_replace(). I now get no errors, but it doesn't look like the preg_replace() is doing anything at all, certainly neither of the stringified conditional returns are returning their values (I'm just using a test string 'replaced_url' to see if it picks up... nope) – danjah Dec 01 '10 at 03:38
  • Ok, so I removed the 'e' and replaced with -1, as php.net says the 3rd arg is for 'count'. So my script doesn't complain, and replacement has happened - but the replacement URL is... the string value of that 2nd arg, $1 has dutifully been replaced by my original URl though, so: /htdocs/book/if(substr('../gst/01.html',0,7)%20==%20'http://'){return%20stripslashes('../gst/01.html');}else{return%20'generate_book.php?link='.%20sanitize_relative_url(%20urlencode(stripslashes('../gst/01.html'))%20);} – danjah Dec 01 '10 at 03:48
  • You need the e modifier on the regex. Otherwise it doesn't evaluate the code and just sticks it in as replacement. I'm just not sure why you're getting the error. I've never had any problems like that before. – Niet the Dark Absol Dec 01 '10 at 04:38
  • Hmm, I'm running WAMP2 with Apache 2.2.11, PHP 5.3.0, MySQL 5.1.36 - so on Windows. I switch between WAMP server via the program tray menu, and multiple installs of Moodle, from 1.9.5 through 1.9.9. I can only run a Moodle, or a WAMP and I've never experienced any unsolvable weird errors, but then I'm a PHP hack not I dedicated clansman. Maybe running all of this is causing instability? Though like I say, I haven't noticed anything obviously wrong. – danjah Dec 01 '10 at 08:45