0

I'm working on a script and need to split strings which contain both html tags and text. I'm trying to isolate the text and elimanate the tags

For example, I want this:

string = '<p><span style="color:#ff3366;">A</span></p><p><span style="color:#ff3366;text-decoration:underline;">B</span></p><p><span style="color:#ff3366;text-decoration:underline;"><em>C</em></span></p>';

to be split like this:

separation = string.split(/some RegExp/);

and become:

separation[0] = "<span style="color:#ff3366;">A</span>";
separation[1] = "<span style="color:#ff3366;text-decoration:underline;">B</span>";
separation[2] = "<span style="color:#ff3366;text-decoration:underline;"><em>C</em></span>";

After that I would like to split the sepeartion string like this:

stringNew = '<span style="color:#ff3366;">A</span>';

extendedSeperation = stringNew.split(/some RegExp/);

extendedSeperation[0] = "A";
extendedSeperation[1] = "style="color:#ff3366;";
Anh NC
  • 47
  • 10
PMe
  • 545
  • 2
  • 9
  • 20
  • Why not just use the parser that you have in the browser ? Everything would be trivial **and** correct. – Denys Séguret Jun 03 '15 at 07:58
  • 2
    Well, even I call it HTML parsing. Do not use any regex, check [Parse a HTML String with JS](http://stackoverflow.com/questions/10585029/parse-a-html-string-with-js). – Wiktor Stribiżew Jun 03 '15 at 07:59
  • Don't use regex parsing for html, it is messy – Arun P Johny Jun 03 '15 at 08:00
  • 1
    http://stackoverflow.com/a/1732454/2331182 theres already an answer for this – Burning Crystals Jun 03 '15 at 08:05
  • 1
    @BurningCrystals: Don't close dup as that question. That question doesn't contain any solution for the problem. – nhahtdh Jun 03 '15 at 08:50
  • I would post an answer for you but it is now closed as a duplicate. While I agree that the linked answers do contain good information as to why not use a RegExp, they do not deal with the requirements of this question. I am voting to reopen and here is a jsFiddle that I would post as an answer. http://jsfiddle.net/Xotic750/ne3vaoop/ – Xotic750 Jun 03 '15 at 09:57

2 Answers2

1

Don't use RegEx for reasons explained in comments.

Instead, do this:

Create an invisible node:

node = $("<div>").css("display", "none");

Attach it to the body:

$("body").append(node);

Now inject your HTML into the node:

node.html(myHTMLString);

Now you can traverse the DOM tree and extract/render it as you like, much like this:

ptags = node.find("p") // will return all <p> tags

To get the content of a tag use:

ptags[0].html()

Finally, to clear the node do:

node.html("");

This should be enough to get you going.

This way you leverage the internal parser of the browser, as suggested in the comments.

pid
  • 11,472
  • 6
  • 34
  • 63
0

Your exact expectations are a little unclear, but based only on the information given here is an example that may give you ideas.

Does not use RegExp

Does not use jQuery or any other library

Does not append and remove elements from the DOM

Is well supported across browsers

function walkTheDOM(node, func) {
    func(node);
    node = node.firstChild;
    while (node) {
        walkTheDOM(node, func);
        node = node.nextSibling;
    }
}

function textContent(node) {
    if (typeof node.textContent !== "undefined" && node.textContent !== null) {
        return node.textContent;
    }

    var text = ""

    walkTheDOM(node, function (current) {
        if (current.nodeType === 3) {
            text += current.nodeValue;
        }
    });

    return text;
}

function dominate(text) {
    var container = document.createElement('div');

    container.innerHTML = text;

    return container;
}

function toSeparation(htmlText) {
    var spans = dominate(htmlText).getElementsByTagName('span'),
        length = spans.length,
        result = [],
        index;

    for (index = 0; index < length; index += 1) {
        result.push(spans[index].outerHTML);
    }

    return result;
}

function toExtendedSeperation(node) {
    var child = dominate(node).firstChild,
        attributes = child.attributes,
        length = attributes.length,
        text = textContent(child),
        result = [],
        style,
        index,
        attr;

    if (text) {
        result.push(text);
    }

    for (index = 0; index < length; index += 1) {
        attr = attributes[index]
        if (attr.name === 'style') {
            result.push(attr.name + '=' + attr.value);

            break;
        }
    }

    return result;
}

var strHTML = '<p><span style="color:#ff3366;">A</span></p><p><span style="color:#ff3366;text-decoration:underline;">B</span></p><p><span style="color:#ff3366;text-decoration:underline;"><em>C</em></span></p>',
    separation = toSeparation(strHTML),
    extendedSeperation = toExtendedSeperation(separation[0]),
    pre = document.getElementById('out');

pre.appendChild(document.createTextNode(JSON.stringify(separation, null, 2)));
pre.appendChild(document.createTextNode('\n\n'));
pre.appendChild(document.createTextNode(JSON.stringify(extendedSeperation, null, 2)));
<pre id="out"></pre>

Of course you will need to make modifications to suit your exact needs.

Xotic750
  • 22,914
  • 8
  • 57
  • 79