0

I have a HTML page in a string, and I need to replace all the spaces in the a href references with %20 so my parser understands it.

So for example:

<a href="file with spaces.mp3">file with spaces.mp3</a>

needs to turn into

<a href="file%20with%20spaces.mp3">file with spaces.mp3</a>

One space works fine since I can just use

(.+?)([ *])(.+?)

and then substitute it with %20 in between $1 and $3

But how would you do it for multiple and an unknown number of spaces, while still having the file name to put the %20's in between?

Bert
  • 113
  • 1
  • 2
  • 8

2 Answers2

4

HTML is not a regular language and cannot be properly parsed using a regular expression. Use a DOM parser instead. Here's a solution using PHP's built-in DOMDocument class:

$dom = new DOMDocument;
$dom->loadHTML($html);

foreach ($dom->getElementsByTagName('a') as $tag) {
    $href = $tag->getAttribute('href');
    $href = str_replace(' ', '%20', $href);
    $tag->setAttribute('href', $href);
}

$html = $dom->saveHTML();

It basically iterates over all the links and changes the href attribute using str_replace.

Demo

Amal Murali
  • 75,622
  • 18
  • 128
  • 150
  • Only works for some examples: `echo rawurlencode('http://example.com/file with spaces.mp3');` Use `str_replace()` or maybe pull out the basename and encode that. – AbraCadaver Mar 27 '14 at 14:53
  • @AbraCadaver: Ah. You're right. I've updated the answer to use `str_replace()` instead. – Amal Murali Mar 27 '14 at 14:56
  • This is probably the best result, the only problem is that I have to work with HTML encoded strings, so < and " for < and " and such, so the DOMDocument can't read these in. – Bert Mar 31 '14 at 07:54
  • @Axon: Simply replace `$dom->loadHTML($html);` with `$dom->loadHTML(htmlspecialchars_decode($html));`. – Amal Murali Mar 31 '14 at 08:05
0

While it's not recommended to use regex, here's a potential regex that works for your example:

(?:<a href="|\G)\S*\K (?=[^">]*")

regex101 demo

(?:
  <a href="   # Match <a href=" literally
|
\G            # Or start the match from the previous end-match
)
\S*           # Match any non-space characters
\K            # Reset the match so only the following matches are replaced
 (?=[^">]*")  # Ensure that the matching part is still within the href link

The above regex could also break on certain edge-cases, so I recommend using DOMDocument in like Amal's excellent answer which is more robust.

Jerry
  • 70,495
  • 13
  • 100
  • 144
  • Your result works most of the time, but for some matches it just doesn't seem to work. I can't find the problem... http://regex101.com/r/zE2zA1 – Bert Mar 31 '14 at 07:53
  • @Axon That's because a character class doesn't keep the order of characters. Using a negative lookahead would work better here: [link](http://regex101.com/r/jB3aM8). – Jerry Mar 31 '14 at 07:56
  • Dayumn, that's one complex regex. But it seems to work fine, thanks for your help man. – Bert Mar 31 '14 at 07:59
  • @Axon Sorry about the complexity ^^; It took me a while to be able to understand such regexes too, so it's natural to feel that way. Glad to have helped :) – Jerry Mar 31 '14 at 08:01