2

I'm after several, really safe regex patterns for JavaScript's .replace method. The input is a serialized DOM string, and I am wanting to remove all YUI3 classNames and YUI3 generated id attributes.

var resourceDOMStr = Y.DataType.XML.format( Y.Node.getDOMNode(this.getIframeDOMContainer()).innerHTML );
alert('unsanitized markup:\n\n'+resourceDOMStr );
// Remove YUI-added id's and classes
    // regex to remove ' id="*"'
    // regex to remove entire class attr: ' class="'yui3-*'"'
    // regex to remove className + trailing space: class="'yui3-* 'safeClass"
    // regex to remove className + leading space: class="safeClass' yui3-*'"
resourceDOMStr.replace('', '');
alert('sanitized markup:\n\n'+resourceDOMStr );

So yeah, I'd like to be clean and remove the entire id attribute, whose value will always begin with 'yui_3', eg; id="yui_3_3_0_1_1296949124608175". Also, I want to remove an entire class attribute if the only class it has is a YUI3-generated className, otherwise I just want to remove the YUI3 className and any leading/trailing spaces. The generated classNames will always begin with 'yui3-', examples;

  • class="yui3-dd-shim"
  • class="safeClass yui3-dd-shim"
  • class="yui3-dd-shim safeClass"

...where I don't want 'safeClass' to be altered, and I don't want a build-up of leading/trailing spaces, as the resulting replaced String will be loaded, cleaned and saved many times over.

Many thanks for any headache-solvers. d


EDIT:

    <div id="wrap"><h1 id="yui_3_3_0_1_1296942015298202" class="yui3-dd-drop">Resource 1 Title</h1>
                            <p id="yui_3_3_0_1_1296942015298219" class="yui3-dd-drop">Lorem ipsum dolor sit amet, <a href="javacript:;" id="yui_3_3_0_1_1296942015298236" class="yui3-dd-drop">consectetur adipiscing</a> elit. Proin et sem leo, sed luctus nisi. Suspendisse pharetra iaculis laoreet. Pellentesque vulputate malesuada auctor. Integer laoreet ultricies nunc facilisis adipiscing.</p>

<div class="widget revealer">
        <p>Revealer widget.</p>
        <script type="text/javascript">
            document.RevealerConfig = true;
        </script>
    </div>

<div class="widget quiz safeClass" id="safeId">
        <p>Quiz widget.</p>
        <script type="text/javascript">
            document.QuizConfig = true;
        </script>
    </div>
                            <div class="snippet yui3-dd-drop" id="yui_3_3_0_1_1296942015298253">
                                Vestibulum fermentum, justo id porta suscipit, velit lorem hendrerit nisi, id tincidunt lectus ante quis lacus. Proin et erat sit amet turpis euismod dictum vitae a metus.
                            <div class="widget table">
        <p>Table widget.</p>
        <table width="80%" border="1">
            <tbody><tr>
                <td>1</td>
                <td>2</td>
                <td>3</td>
            </tr>
            <tr>
                <td>4</td>
                <td>5</td>
                <td>6</td>
            </tr>
            <tr>
                <td>7</td>
                <td>8</td>
                <td>9</td>
            </tr>
        </tbody></table>
    </div></div>
                            <p id="yui_3_3_0_1_1296942015298270" class="yui3-dd-drop">Proin et sem leo, sed luctus nisi. Suspendisse pharetra iaculis laoreet. Pellentesque vulputate; laoreet ultricies nunc facilisis adipiscing ultricies nunc.</p>

<div class="widget table">
        <p>Table widget.</p>
        <table width="80%" border="1">
            <tbody><tr>
                <td>1</td>
                <td>
<ul>
<li>1</li>
<li>2<ul><li id="yui_2_0_0_1">nested</li></ul></li>
</ul>
</td>
                <td>3</td>
            </tr>
            <tr>
                <td>4</td>
                <td>5</td>
                <td>6</td>
            </tr>
            <tr>
                <td class="yui2-dd-drop yui3-dd-drop">7</td>
                <td class="yui2-dd-drop yui3-dd-drop">8</td>
                <td class="yui2-dd-drop yui3-dd-drop">9</td>
            </tr>
        </tbody></table>
    </div>
</div>

Hopefully the above is all good, don't pick it apart too readily - as stated in comment below, its sample html.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
danjah
  • 2,939
  • 2
  • 30
  • 47
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Michael Robinson Feb 06 '11 at 00:13
  • Bah. He’s already got DOM data. Stop knee-jerking. It’s not sensible, it’s not helpful, and it is misleading. Using regexes on HTML **CAN** be perfectly reasonable, and to pretend otherwise is a disservice to the querent — and yourselves. – tchrist Feb 06 '11 at 00:16
  • But... but... the regex... html.. – Michael Robinson Feb 06 '11 at 00:21
  • @Danjah: What do you mean by “safe”? What would be an example of something that were “unsafe”? – tchrist Feb 06 '11 at 00:22
  • @Michael: Yeah so? Tame HTML, which he seems to have, is a breeze with regexes. And [some of us](http://stackoverflow.com/questions/4840988/the-recognizing-power-of-modern-regexes/4843579#4843579) really are anything but intimidated by [harder problems](http://stackoverflow.com/questions/4284176/doubt-in-parsing-data-in-perl-where-am-i-going-wrong/4286326#4286326). Note that the last example is my attempt to convince someone **not** to use regexes on generic HTML, showing them how hard it is to do right. – tchrist Feb 06 '11 at 00:28
  • 1
    I guess I provided the link as a knee-jerk reaction, as you said. I may have been pushed to far towards the 'no regex for HTML' side of things by answers like the one I linked to - funny but without much actual content. Thanks for shaking me out of it. – Michael Robinson Feb 06 '11 at 00:33
  • @tchrist: 'unsafe' would simply be the removal of classes other than the yui3 classes. the id attribute can safely be removed if it exists. – danjah Feb 06 '11 at 00:36
  • Could you provide a sample of the HTML strings you want cleaned? Hard to test our answers without it... – Michael Robinson Feb 06 '11 at 00:36
  • Sort of, the HTML isn't actually formed yet, for the most part, but I am working with some sample HTML - but its very basic. I've inserted a few anticipated 'curve balls', but as some of you point out - this can be a risky business. I will also make my way through the links posted, thanks for the heads up. I will also consider using YUI DOM traversal on a copy of the DOM structure I've serialized to perform class removal and hopefully id attribute removal - though it may get quite expensive to do this. – danjah Feb 06 '11 at 01:03

1 Answers1

1

You could try this monstrosity:

var dirty = 'class="yui3-dd-shim" class="safeClass yui3-dd-shim" class="yui3-dd-shim safeClass"';

var clean = dirty.replace(/class="yui[0-9]-[^\s]+"|\s?yui[0-9]-[^\s"]+\s?|id="yui_[0-9][^"]+"/gi, '');

Tested it on your sample data, seemed to do the job.

Michael Robinson
  • 29,278
  • 12
  • 104
  • 130
  • That is no monstrosity! It’s just a little naïve. It might work on tame HTML though. – tchrist Feb 06 '11 at 00:21
  • Then we will hope for tame HTML! I'm no expert with regex, but I do enjoy them. Do you have any quick pointers on making something like that less naïve? – Michael Robinson Feb 06 '11 at 00:27
  • You can look at the various techniques I apply [here](http://stackoverflow.com/questions/4284176/doubt-in-parsing-data-in-perl-where-am-i-going-wrong/4286326#4286326) on **non-tame** HTML. Just bear in mind that I was trying to show folks how hard it is to correctly deal with non-tame, in-the-wild HTML, and why they almost certainly do not want to do that. (Most people misread my posting, thinking I mean the reverse.) However, for same stuff, it should not be a problem. Somewhere I have a list of the things to worry about, but I have a lot of postings so it’s hard to find. – tchrist Feb 06 '11 at 00:32