regex - replace spaces IN stored element

Question

I have a HTML page in a string, and I need to replace all the spaces in the a href references with %20 so my parser understands it.

So for example:

<a href="file with spaces.mp3">file with spaces.mp3</a>

needs to turn into

<a href="file%20with%20spaces.mp3">file with spaces.mp3</a>

One space works fine since I can just use

(.+?)([ *])(.+?)

and then substitute it with %20 in between $1 and $3

But how would you do it for multiple and an unknown number of spaces, while still having the file name to put the %20's in between?

want to replace all spaces or only ones in href... what about src for example? — Dexa, Mar 27 '14 at 14:50
Here we go again: [Don't parse (X)HTML with regex!](http://stackoverflow.com/a/1732454/418066) And you probably want a proper URL encoder. — Biffen, Mar 27 '14 at 14:50
It's just to replace spaces, I don't see how it would kill my HTML — Bert, Mar 27 '14 at 14:50
[DOMDocument](http://stackoverflow.com/questions/5278418/using-domdocument-and-parsing-info-i-would-like-to-get-the-href-contents-of) then run `str_replace()`. — AbraCadaver, Mar 27 '14 at 14:51
Also @Dexa, I only need the references in , it has to do with my parser that I'm writing. The src (for img tags) already works without replacing the spaces — Bert, Mar 27 '14 at 14:51

Amal Murali · Answer 1 · 2014-03-27T14:55:50.990

4

HTML is not a regular language and cannot be properly parsed using a regular expression. Use a DOM parser instead. Here's a solution using PHP's built-in DOMDocument class:

$dom = new DOMDocument;
$dom->loadHTML($html);

foreach ($dom->getElementsByTagName('a') as $tag) {
    $href = $tag->getAttribute('href');
    $href = str_replace(' ', '%20', $href);
    $tag->setAttribute('href', $href);
}

$html = $dom->saveHTML();

It basically iterates over all the links and changes the href attribute using str_replace.

Demo

edited Mar 27 '14 at 14:55

answered Mar 27 '14 at 14:50

Amal Murali

75,622
18
128
150

Only works for some examples: `echo rawurlencode('http://example.com/file with spaces.mp3');` Use `str_replace()` or maybe pull out the basename and encode that. – AbraCadaver Mar 27 '14 at 14:53
@AbraCadaver: Ah. You're right. I've updated the answer to use `str_replace()` instead. – Amal Murali Mar 27 '14 at 14:56
This is probably the best result, the only problem is that I have to work with HTML encoded strings, so < and " for < and " and such, so the DOMDocument can't read these in. – Bert Mar 31 '14 at 07:54
@Axon: Simply replace `$dom->loadHTML($html);` with `$dom->loadHTML(htmlspecialchars_decode($html));`. – Amal Murali Mar 31 '14 at 08:05

score 0 · Accepted Answer · answered Mar 27 '14 at 15:25

0

While it's not recommended to use regex, here's a potential regex that works for your example:

(?:<a href="|\G)\S*\K (?=[^">]*")

regex101 demo

(?:
  <a href="   # Match <a href=" literally
|
\G            # Or start the match from the previous end-match
)
\S*           # Match any non-space characters
\K            # Reset the match so only the following matches are replaced
 (?=[^">]*")  # Ensure that the matching part is still within the href link

The above regex could also break on certain edge-cases, so I recommend using DOMDocument in like Amal's excellent answer which is more robust.

answered Mar 27 '14 at 15:25

Jerry

70,495
13
100
144

Your result works most of the time, but for some matches it just doesn't seem to work. I can't find the problem... http://regex101.com/r/zE2zA1 – Bert Mar 31 '14 at 07:53
@Axon That's because a character class doesn't keep the order of characters. Using a negative lookahead would work better here: [link](http://regex101.com/r/jB3aM8). – Jerry Mar 31 '14 at 07:56
Dayumn, that's one complex regex. But it seems to work fine, thanks for your help man. – Bert Mar 31 '14 at 07:59
@Axon Sorry about the complexity ^^; It took me a while to be able to understand such regexes too, so it's natural to feel that way. Glad to have helped :) – Jerry Mar 31 '14 at 08:01

regex - replace spaces IN stored element

2 Answers2