Split a string in javascript

Question

I need to split a string according to the next idea:

const strin = 'test <br><span>test</span>  <div>aa</div>8'.split(/<\ *>/i)
console.log(strin)

So, the expected output is next: ['test','<br>', '<span>test</span>', '<div>aa</div>', '8']

That seems like a strange requirement. Why not parse the string as HTML instead? For example `Array.from(new DOMParser().parseFromString("test
test
aa
8", "text/html").body.childNodes)` gets you an array of all six nodes (you’re missing the spaces between `` and `
`). — Sebastian Simon, Nov 01 '21 at 19:22
A classic [XY problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). Array.prototype.split is not a good solution to this problem due to how complicated it will be, so you need to look for a new approach such as @SebastianSimon's DOMParser idea — Samathingamajig, Nov 01 '21 at 19:26
@Samathingamajig It’s not only “complicated”, it’s _impossible_. HTML isn’t a regular language; regular expressions, as the name suggests, only work for regular languages. You can use loops and other constructs outside of regular expressions, but this is akin to writing your own HTML parser, which is superfluous since `DOMParser` already exists. — Sebastian Simon, Nov 01 '21 at 19:28
@SebastianSimon, i want to group each html tag and after that to test with a regex f it is valid — Asking, Nov 01 '21 at 19:30
Ah, I see, it’s related to your earlier question: [Test if string is a valid HTML code using javascript](/q/69799487/4642212). I’d still like to see a solid definition of “valid”. — Sebastian Simon, Nov 01 '21 at 19:31
@SebastianSimon, VALID = is a real Html tag that developers use(ex: `test` !== valid html,
test
=== valid,
test
!== valid ). It is clear now? — Asking, Nov 01 '21 at 19:40
@Asking So you’re not actually asking about _HTML validity_ (in the sense of HTML specification conformity) at all; `test` clearly is a valid HTML fragment; your examples should have ` `, etc. to be considered _valid_. Do you want to check if a string is exlusively made up of elements, i.e. `"A
B
"`, but not `"A
B
C"`? What about whitespace between tags? What about nested tags? What result do you expect? Simply a boolean indicating if the string meets your requirements? Or do you want to filter out text nodes? [Edit] to clarify! — Sebastian Simon, Nov 01 '21 at 19:46
The validation in my situation means: 1 each string should be wrapped in a html tag 2. Are allowed the html that don’t have closing tag like
3. The whitespaces between tags are allowed 4.nested tags are allowed. If the string contains and element that is not allowed the function should return false if not returns true. — Asking, Nov 01 '21 at 19:58
So `const valid = !Array.from(new DOMParser().parseFromString("test
test
aa
8 asd", "text/html").body.childNodes).filter((node) => (node.nodeType !== Document.ELEMENT_NODE || node instanceof HTMLUnknownElement) && (node.nodeType !== Document.TEXT_NODE || node.textContent.trim())).length;` is probably a good start. — Sebastian Simon, Nov 01 '21 at 20:21
@SebastianSimon, very nice, but i tried to test `
test
aass
` and it gives me true. Could you help please? It will help me a lot — Asking, Nov 01 '21 at 20:29
@SebastianSimon, probably the issue is related with nested tags. — Asking, Nov 01 '21 at 20:33

score 0 · Answer 1 · answered Nov 01 '21 at 20:03

0

As @sebastian-simon mentioned, "split" HTML with only regular expression is impossible. The best solution is use a real HTML parser (already shipped with your browser, if you are using Node.js, you can use JSDOM).

var str = 'test <br><span>test</span> <fake></fake> <div><p>aa</p></div>8';
var container = document.createElement("div");
container.innerHTML = str; // use a HTML element to parse HTML

// If you need to work with nested tag, you should traverse childNodes and their childNodes by yourself

// childNodes included TextNode, children not.
// [...container.childNodes] convert container.childNodes to a normal array
// so we can .map over it
var elmList = [...container.childNodes];
var tags = elmList
  // if elm is a TextNode, elm.outerHTML is undefined
  // then we use elm.textContent instead
  .map(elm => elm.outerHTML ?? elm.textContent)
  .map(elm => elm.trim()) // remove whitespaces
  .filter(elm => elm); // remove empty items
console.log(tags)

answered Nov 01 '21 at 20:03

i am not sure if i undertand regarding nested tags. Could you show please? Probably you mean this :
4test
– Asking Nov 01 '21 at 20:21
Yes, "nested tags" means something like `
4test
`, a HTML tag in another HTML tag. – Nov 01 '21 at 20:29
how to purceed in this situation? – Asking Nov 01 '21 at 20:30
Recursive will do that, you start with `container`, check every childNodes of it. Then for each childNodes, check it's own childNodes, then do same thing to those "grandchildNodes"... – Nov 01 '21 at 20:40
Could you show please? – Asking Nov 01 '21 at 20:42

Split a string in javascript

1 Answers1