10

I need help writing a regex function that converts HTML string to a valid XML tag name. Ex: It takes a string and does the following:

  • If an alphabet or underscore occurs in the string, it keeps it
  • If any other character occurs, it's removed from the output string.
  • If any other character occurs between words or letters, it's replaced with an Underscore.
Ex:
Input: Date Created
Ouput: Date_Created

Input: Date<br/>Created
Output: Date_Created

Input: Date\nCreated
Output: Date_Created

Input: Date    1 2 3 Created
Output: Date_Created

Basically the regex function should convert the HTML string to a valid XML tag.

ROMANIA_engineer
  • 54,432
  • 29
  • 203
  • 199
Jake
  • 25,479
  • 31
  • 107
  • 168
  • 3
    Your question says "I want to write", but it reads like a requirement list and waiting for someone to drop the desired magic regex codes. Not clear what you consider XML tags anyway, the output examples contain none. – mario Jun 03 '12 at 04:20
  • @JackManey: That has over 4000 upvotes now..? Sheesh. – mpen Jun 03 '12 at 04:22
  • 1
    What's wrong if the situation comes only once in a blue moon and it's just to add a ``quick and dirty patch-up`` to your test code in a whirl! AND USE REGEX INSTEAD OF DOM... – Cylian Jun 03 '12 at 04:29

4 Answers4

5

A bit of regex and a bit of standard functions:

function mystrip($s)
{
        // add spaces around angle brackets to separate tag-like parts
        // e.g. "<br />" becomes " <br /> "
        // then let strip_tags take care of removing html tags
        $s = strip_tags(str_replace(array('<', '>'), array(' <', '> '), $s));

        // any sequence of characters that are not alphabet or underscore
        // gets replaced by a single underscore
        return preg_replace('/[^a-z_]+/i', '_', $s);
}
Ja͢ck
  • 170,779
  • 38
  • 263
  • 309
2

Try this

$result = preg_replace('/([\d\s]|<[^<>]+>)/', '_', $subject);

Explanation

"
(               # Match the regular expression below and capture its match into backreference number 1
                   # Match either the regular expression below (attempting the next alternative only if this one fails)
      [\d\s]          # Match a single character present in the list below
                         # A single digit 0..9
                         # A whitespace character (spaces, tabs, and line breaks)
   |               # Or match regular expression number 2 below (the entire group fails if this one fails to match)
      <               # Match the character “<” literally
      [^<>]           # Match a single character NOT present in the list “<>”
         +               # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      >               # Match the character “>” literally
)
"
Cylian
  • 10,970
  • 4
  • 42
  • 55
2

Should be able to use:

$text = preg_replace( '/(?<=[a-zA-Z])[^a-zA-Z_]+(?=[a-zA-Z])/', '_', $text );

So, there's lookarounds to see if there's an alpha character before and after, and replaces any non-alpha / non-underscore between it.

adomnom
  • 564
  • 2
  • 9
1

I believe the following should work.

preg_replace('/[^A-Za-z_]+(.*)?([^A-Za-z_]+)?/', '_', $string);

The first part of the regex [^A-Za-z_]+ matches one or more characters that is not alphabetical or an underscore. The end part of the regex is the same, except it is optional. That's to allow the middle part, (.*)? which is also optional, to catch any characters (even alphabetical and underscores) between two blacklisted characters.

Litty
  • 1,856
  • 1
  • 16
  • 35