Replace every character with an element

Question

This is what I have

$str = 'Just a <span class="green">little</span> -text åäö width 123#';

This is what I need

Results in spans and spaces, might be newlines as well.

$result = '<span></span><span></span><span></span><span></span> <span></span> <span class="green"><span></span><span></span><span></span><span></span><span></span><span></span></span> <span></span><span></span><span></span><span></span><span></span> <span></span><span></span><span></span> <span></span><span></span><span></span><span></span><span></span> <span></span><span></span><span></span>';

You might wonder what I can possible be needing this for. I want to build a thing where ever character is represented by a block. Will look a bit like Defrag on Windows XP.

Question

Replace every character with <span></span>.
Do not touch the HTML span that already exists in the string (might be hard?). There can be more than one HTML element.
Do not touch spaces and newline.
Regexp should do it? or Xpath?

What have I done so far?

I have found articles about the regexp but not replacing every character (excerpt space and newline)

$result = preg_replace("/???/", "<span></span>", $str);
print_r($result);

try `preg_replace("/([^:space:\n])/", "", $str);` [] is a set of characters, ^ is NOT, :space: or \s is a space \n is newline — Waygood, Apr 30 '13 at 10:13
The "don't touch the HTML that already exists in the string" part is where regex solutions cause problems. You really want to use a DOM parser, to iterate only over the textnodes and apply a `/\S/` -> `` replacement on those. [Here is a good overview of your DOM-parsing options](http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-xml) — Martin Ender, Apr 30 '13 at 10:36
There can be more than one. I updated my question information. — Jens Törnell, Apr 30 '13 at 10:49

score 2 · Answer 1 · answered Apr 30 '13 at 11:08

2

You can use preg_replace_callback()

$str = 'Just a <span class="green">little</span> -text åäö width 123#';

function replacement($matches) {
            if (strlen($matches[0]) == 1) 
            {
                return "<span></span>";
            }
            else 
           {
               return $matches[0];
           }
}

$result = preg_replace_callback("~<span.*?<\s*/\s*span>|\S~", "replacement", $str);
print_r($result);

This is just calculate the replacement string dependent on the match. If the length of the match is 1 (a non whitespace character has been found), then replace with the "span" tags, else a span tag has been found, reinsert this.

answered Apr 30 '13 at 11:08

stema

90,351
20
107
135

@Waygood, no because `\S` is a non-whitespace character, newlines belong to the whitespace characters, they are not matched. – stema Apr 30 '13 at 11:14
does anything else belong to 'whitespace characters' too? if its not just a space and new line, e.g. tab \t, then the results will be wrong? – Waygood Apr 30 '13 at 11:17
1

@Waygood of course a tab is also a whitespace character, since it prints only whitespace. If this is a problem, a negated character class should be used `~|[^ \r\n]~`. This would match really every character that is not a space or a newline. – stema Apr 30 '13 at 11:22

Ruslan Polutsygan · Answer 2 · 2013-04-30T11:05:40.623

is it a requirement to use only one regular expression?

if not - you could replace substring which you need to safe with some unique character, execute replacing by regexp, put substring instead of that unique char.

Just like this:

$str2 = str_replace('<span class="green">little</span>', '$', $str);
$str3 = preg_replace("/([^\s\n\$])/", "<span></span>", $str2);
$result = str_replace('$', '<span class="green">little</span>', $str3);

see live demo http://codepad.viper-7.com/7wu9fd

UPD:

Perhaps it should be considered just as hint. My suggestion was to store substring(s) what needed to be stored, replace everything you need, put stored values back into string.

$str = 'Just a <span class="green">little</span> -text åäö width 123#';

preg_match_all('/<[^>]+>/', $str, $matches);
$storage=array();
for($i=0, $n=count($matches[0]); $i<$n; $i++)
{
    $key=str_repeat('$', $i+1);
    $value=$matches[0][$i];
    $storage[$key]=$value;
    $str=str_replace($value, $key, $str);
}
$storage=array_reverse($storage);

$str = preg_replace("/([^\s\n\$])/", "<span></span>", $str);
foreach($storage as $k=>$v)
{
    $str=str_replace($k, $v, $str);
}
echo htmlspecialchars($str);

working demo is there http://codepad.viper-7.com/L4YZOz

Interesting solution. Too bad it's not an option for me. The 'little' can be anything and should be converted to spans as well. — Jens Törnell, Apr 30 '13 at 10:46

alexn · Accepted Answer · 2013-05-01T09:46:13.473

There is no need for hacky regex-solutions. A simple for loop with a state machine should do just fine:

define('STATE_READING', 1);
define('STATE_TAG', 2);

$str = 'Just a <span class="green">little</span> -text åäö width 123#';
$result = '';

$state = STATE_READING;
for($i = 0, $len = strlen($str); $i < $len; $i++) {
    $chr = $str[$i];

    if($chr == '<') {
        $state = STATE_TAG;
        $result .= $chr;
    } else if($chr == '>') {
        $state = STATE_READING;
        $result .= $chr;
    } else if($state == STATE_TAG || strlen(trim($chr)) === 0) {
        $result .= $chr;
    } else {
        $result .= '<span></span>';
    }
}

This loop is just keeping track if we are reading a tag or a single character. If it is a tag (or whitespace), append the actual character, otherwise append <span></span>.

Results in:

<span></span><span></span><span></span><span></span> <span></span> <span class="green"><span></span><span></span><span></span><span></span><span></span><span></span></span> <span></span><span></span><span></span><span></span><span></span> <span></span><span></span><span></span><span></span><span></span><span></span> <span></span><span></span><span></span><span></span><span></span> <span></span><span></span><span></span><span></span>

I prefer it over magical numbers. `$state == STATE_TAG` shows the intent better than `$state == 2` or `$state == 'x'`. — alexn, May 02 '13 at 11:58

score 0 · Answer 4 · answered Apr 30 '13 at 10:38

While this is probably possible with a regex, but I'd go with a loop. Example code below is for single-byte character sets but can be modified for multi-byte (e.g. UTF-16) or variable-byte (e.g. UTF-8) character set.

$input = 'Just a <span class="green">little</span> -text åäö width 123#';
$output = '';
$length = strlen($input);
$i = 0;
$matches = array(); // preg_match variable
// While for finer control
while($i < $length) {
    // Check for start of span tag, check for < character first for speed-up
    if($input[$i] == "<" && preg_match("#<span[^>]*>.*</span>#siU", substr($input, $i), $matches) == 1) {
        // Skip the span tag
        $i = $i + strlen($matches[0]);
        $output .= $matches[0];
    } else {
        $output .= "<span></span>";
        $i++;
    }
}

Working example

Haven't tested the code very well, might be some boundary conditions left, but the idea should be clear. — dtech, Apr 30 '13 at 10:39

Waygood · Answer 5 · 2013-04-30T11:09:18.370

0

Bit of a hack but try this:

$str="Just a <span class=\"green\">little</span> -text åäö\n width 123#";

// get all span tags
if(preg_match_all("/(\<span.*\<\/span\>)/", $str, $matches))
{
    // replace spans with #
    $str=preg_replace_all("/(\<span.*\<\/span\>)/", "#", $str);

    //print_r($matches);
}
// replace all non spaces, CR and #
$str=preg_replace("/[^\s\n#]/", "<span></span>", $str);
// replenish the matched spans
while(list($key,$value)=each($matches[0]))
{
    $str=preg_replace('/#/', $value, $str, 1);
}

edited Apr 30 '13 at 11:09

answered Apr 30 '13 at 10:40

Waygood

2,657
2
15
16

Won't this break if `$str` contains a `#` in it somewhere between two span tags? – dtech Apr 30 '13 at 10:43
Yes, if there is a # outside of a set, that's why its a hack – Waygood Apr 30 '13 at 11:08

score 0 · Answer 6 · answered Apr 30 '13 at 11:06

So here's what I came up with using preg_replace_callback():

$str = 'Just a <span class="green">little</span>-text åäö width 123#<span>aaa</span> lol';

// This requires PHP 5.3+
$output = preg_replace_callback('#.*?(<span[^>]*>.*?</span>)|.*#is', function($m){
    if(!isset($m[1])){return preg_replace('/\S/', '<span></span>', $m[0]);}
    $array = explode($m[1], $m[0]);
    $array = preg_replace('/\S/', '<span></span>', $array);
    return(implode($m[1], $array));
}, $str);
echo($output);

Output:

<span></span><span></span><span></span><span></span> <span></span> <span class="green">little</span><span></span><span></span><span></span><span></span><span></span> <span></span><span></span><span></span><span></span><span></span><span></span> <span></span><span></span><span></span><span></span><span></span> <span></span><span></span><span></span><span></span><span>aaa</span> <span></span><span></span><span></span>

score 0 · Answer 7 · answered Nov 15 '17 at 02:50

This is NOT a hacky regex method. This is a solid, concise, one-line-one-function-call solution that avoids having to iterate a battery of conditions on each character in a string, preserves tags, and cares for multi-byte characters.

alexn's solution does not maintain the visible character length of åäö. His solution will print 6 opening and closing span tags to screen instead of just 3. This is because mb_ functions are not used. On this topic, be wary of any methods on this page that are not using mb_ prefixed string functions.

My suggested solution will leverage the (*SKIP)(*FAIL) technique to ignore/disqualify all encountered tags and then only match non-white-space characters in the string.

Code: (Demo)

$str = 'Just a <span class="green">little</span> -text åäö width 123#';
var_export(preg_replace('/<[^>]*>(*SKIP)(*FAIL)|\S/','<span></span>',$str));  // no "u" flag means åäö will be span x6
echo "\n";
var_export(preg_replace('/<[^>]*>(*SKIP)(*FAIL)|\S/u','<span></span>',$str)); // "u" flag means åäö will be span x3

Output: (scroll right to see the impact of the unicode flag on the pattern)

'<span></span><span></span><span></span><span></span> <span></span> <span class="green"><span></span><span></span><span></span><span></span><span></span><span></span></span> <span></span><span></span><span></span><span></span><span></span> <span></span><span></span><span></span><span></span><span></span><span></span> <span></span><span></span><span></span><span></span><span></span> <span></span><span></span><span></span><span></span>'
// notice the number of replacements for åäö ->-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------111111111111122222222222223333333333333444444444444455555555555556666666666666
'<span></span><span></span><span></span><span></span> <span></span> <span class="green"><span></span><span></span><span></span><span></span><span></span><span></span></span> <span></span><span></span><span></span><span></span><span></span> <span></span><span></span><span></span> <span></span><span></span><span></span><span></span><span></span> <span></span><span></span><span></span><span></span>'
// notice the number of replacements for åäö ->-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------111111111111122222222222223333333333333

@JensTörnell How many sets of span tags did you want to see when replacing the multibyte characters? `åäö` should become 3 sets or 6 sets? It seems to me that you would only want three because there is no added benefit to six. — mickmackusa, Nov 19 '17 at 06:55

Replace every character with an element

7 Answers7