Regex - Convert HTML to valid XML tag

Question

I need help writing a regex function that converts HTML string to a valid XML tag name. Ex: It takes a string and does the following:

If an alphabet or underscore occurs in the string, it keeps it
If any other character occurs, it's removed from the output string.
If any other character occurs between words or letters, it's replaced with an Underscore.

Ex:
Input: Date Created
Ouput: Date_Created

Input: Date<br/>Created
Output: Date_Created

Input: Date\nCreated
Output: Date_Created

Input: Date    1 2 3 Created
Output: Date_Created

Basically the regex function should convert the HTML string to a valid XML tag.

Your question says "I want to write", but it reads like a requirement list and waiting for someone to drop the desired magic regex codes. Not clear what you consider XML tags anyway, the output examples contain none. — mario, Jun 03 '12 at 04:20
What's wrong if the situation comes only once in a blue moon and it's just to add a ``quick and dirty patch-up`` to your test code in a whirl! AND USE REGEX INSTEAD OF DOM... — Cylian, Jun 03 '12 at 04:29

score 5 · Accepted Answer · answered Jun 03 '12 at 04:39

A bit of regex and a bit of standard functions:

function mystrip($s)
{
        // add spaces around angle brackets to separate tag-like parts
        // e.g. "<br />" becomes " <br /> "
        // then let strip_tags take care of removing html tags
        $s = strip_tags(str_replace(array('<', '>'), array(' <', '> '), $s));

        // any sequence of characters that are not alphabet or underscore
        // gets replaced by a single underscore
        return preg_replace('/[^a-z_]+/i', '_', $s);
}

score 2 · Answer 2 · answered Jun 03 '12 at 04:20

Try this

$result = preg_replace('/([\d\s]|<[^<>]+>)/', '_', $subject);

Explanation

"
(               # Match the regular expression below and capture its match into backreference number 1
                   # Match either the regular expression below (attempting the next alternative only if this one fails)
      [\d\s]          # Match a single character present in the list below
                         # A single digit 0..9
                         # A whitespace character (spaces, tabs, and line breaks)
   |               # Or match regular expression number 2 below (the entire group fails if this one fails to match)
      <               # Match the character “<” literally
      [^<>]           # Match a single character NOT present in the list “<>”
         +               # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      >               # Match the character “>” literally
)
"

score 2 · Answer 3 · answered Jun 03 '12 at 04:20

2

Should be able to use:

$text = preg_replace( '/(?<=[a-zA-Z])[^a-zA-Z_]+(?=[a-zA-Z])/', '_', $text );

So, there's lookarounds to see if there's an alpha character before and after, and replaces any non-alpha / non-underscore between it.

answered Jun 03 '12 at 04:20

adomnom

564
2
9

score 1 · Answer 4 · answered Jun 03 '12 at 04:22

I believe the following should work.

preg_replace('/[^A-Za-z_]+(.*)?([^A-Za-z_]+)?/', '_', $string);

The first part of the regex [^A-Za-z_]+ matches one or more characters that is not alphabetical or an underscore. The end part of the regex is the same, except it is optional. That's to allow the middle part, (.*)? which is also optional, to catch any characters (even alphabetical and underscores) between two blacklisted characters.

Regex - Convert HTML to valid XML tag

4 Answers4