How does one extract any text-content from HTML-code which does contain whitespace but neither tab nor line-break?

Question

How to find and select in any html only text with spaces but without tabs and line breaks and not select the tags themselves.

From the opposite, I succeeded, but as I looked above - no

<html>
<body>
<h1> text1</h1>
<p>text2</p>
text14
<p>   text3   </p>
text2
</body>
</html>

This is what I got:

<[^>]+>(.+?)<\/[^>]+>

You need to escape the slash: `\/` https://regex101.com/r/uotHkT/1 — mplungjan, Jun 27 '23 at 12:58
@mplungjan ... what happens with e.g. ... [nested html tags](https://regex101.com/r/uotHkT/2)? — Peter Seliger, Jun 27 '23 at 13:01
@dedtis ... One needs a real [dom parsing](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser) approach; regex is not suited for such parsing tasks. — Peter Seliger, Jun 27 '23 at 13:02
@PeterSeliger [Of course it doesn't work](https://stackoverflow.com/a/1732454/295783) — mplungjan, Jun 27 '23 at 13:05
this is just a task for checking knowledge of regexp but not for production — dedtis, Jun 27 '23 at 13:26
Does this answer your question? [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — symcbean, Jun 27 '23 at 13:30
Your example fails to explain what should happen, for tags like `
` and situations like `
aaabbbccc
`. Assuming you want any text, outside of tags, you might start with something like [`(?<=>\s*)(?!\s)[^<>]+(?<!\s)`](https://regex101.com/r/IanGer/2). — markalex, Jun 27 '23 at 13:56
You can add `\n\t` to the negated class and use a [capture group](https://www.regular-expressions.info/brackets.html) for extraction ([regex101 demo](https://regex101.com/r/x5He8E/1)). — bobble bubble, Jun 27 '23 at 14:51
@dedtis ... Regarding all the so far provided answers, are there any questions left? — Peter Seliger, Jun 29 '23 at 15:31

mplungjan · Answer 1 · 2023-07-10T06:29:00.107

1

Assuming you wanted

["text1", "text2", "text3"]

and wanted to ignore the nodes with tabs or newlines

then you can use parseFromString and createNodeIterator

and do this:

const htmlStr = `<html>
    <body>
      <h1> text1</h1>
      <p>text2</p>
      text14 is ignored due to newlines
      <p> text3 </p>
      text2
    </body>
    </html>`
const parser = new DOMParser();
const dom = parser.parseFromString(htmlStr, "text/html");

let currentNode,
  nodeIterator = document.createNodeIterator(dom.documentElement, NodeFilter.SHOW_TEXT);

const textArr = [];
while (currentNode = nodeIterator.nextNode()) {
  const text = currentNode.textContent;
  const textHasTabsOrNewlines = text.match(/[\t\n]/);
  console.log("text:>", currentNode.textContent, "<", textHasTabsOrNewlines)
  const textOnly = text.trim();
  if (textOnly !== "" && !textHasTabsOrNewlines) textArr.push(textOnly);
}
console.log(textArr);

edited Jul 10 '23 at 06:29

answered Jun 27 '23 at 13:16

mplungjan

169,008
28
173
236

yeah, me need regexp – dedtis Jun 27 '23 at 13:20
Why? If we knew why, we could help better. – mplungjan Jun 27 '23 at 13:23
this is just a task for checking knowledge of regexp but not for production – dedtis Jun 27 '23 at 13:26
4

So we now give you the knowledge that it is not a task for regexp :) – mplungjan Jun 27 '23 at 13:31
it's a pity, but I would still like to look at the solution, since there are a lot of solutions with highlighting the tags themselves. – dedtis Jun 27 '23 at 13:36

Peter Seliger · Answer 2 · 2023-06-27T15:22:09.977

1

The requirements as in the OP's own words ...

"how to find and select in any html only text with spaces but without tabs and line breaks and not select the tags themselves"

The approach needs to be manifold. This is due to neither a pure regex based nor a DOMParser and NodeIterator based approach are capable of returning the OP's expected result.

But a NodeIterator instance with an additionally applied filter where the latter uses 2 regex pattern based tests does the job ...

const code =
`<html>
  <body>
    <h1>foo</h1>  <!-- no pick ... not a single white space at all -->
    <p>  bar </p> <!-- pick... ... simple spaces only -->
    baz           <!-- no pick ... leading tab and new line -->
    <p>bizz</p>   <!-- no pick ... not a single white space at all -->
    buzz          <!-- no pick ... leading simple spaces and new line -->
    <p>booz  </p> <!-- pick... ... simple spaces only -->
  </body>
</html>`;

const dom = (new DOMParser)
  .parseFromString(code, 'text/html');

const textNodeIterator =
  document.createNodeIterator(
    dom.documentElement,
    NodeFilter.SHOW_TEXT,
    node => (
      (node.textContent.trim() !== '') && // - content other than just white space(s)
      (/\s+/).test(node.textContent) &&   // - content with any kind of white space
      !(/[\t\n]+/).test(node.textContent) // - content without tabs and new lines
    )
    ? NodeFilter.FILTER_ACCEPT
    : NodeFilter.FILTER_REJECT
  );

const textContentList = [];
let textNode;

while (textNode = textNodeIterator.nextNode()) {
  textContentList.push(textNode.textContent)
}
console.log({ textContentList });

.as-console-wrapper { min-height: 100%!important; top: 0; }

edited Jun 27 '23 at 15:22

answered Jun 27 '23 at 14:19

Peter Seliger

11,747
3
28
37

I'm really curious about the technical reason/fault for giving the above answer an uncommented -1 vote. [Especially since another regex related answer of mine experienced the same behavioral pattern](https://stackoverflow.com/a/76527984/2627243). The above solution is the only one that actually fully meets the OP's requirements. It exactly explains the approach and why a regex only solution is not suited for the OP's task. Without comments nobody gets an understanding of what's wrong with the above approach, and the answer can not be improved either. – Peter Seliger Jul 04 '23 at 08:10
I got one too. I assume voter does not like we try to help someone who wants to regex HTML – mplungjan Jul 04 '23 at 08:32
the expected output is not clear I would say – mplungjan Jul 04 '23 at 08:38
@mplungjan ... Most probably. Which means the OP could have pointed that and as a result could have refined the question. – Peter Seliger Jul 04 '23 at 08:42
I updated my code, but cannot save (rate limited?) to what I now assume is what OP wantd – mplungjan Jul 04 '23 at 08:54
1

Uncommented -1 vote again. Why? Where is the technical fault which justifies -1 votes? The above solution does exactly match the OP's requirements. The approach got explained. The implementation got commented. The OP has been informed that a regex only solution is not reliable, but a parser based on is. And since the latter alone can not solve the problem entirely either, one has to combined parser and regex. The above answer mentions all that and provides exactly what has been ask for. **Without openly communicating a reason for ones dissatisfaction the above answer can not be improved.** – Peter Seliger Jul 09 '23 at 20:51
I also have two downvotes... Boggles the mind. – mplungjan Jul 10 '23 at 06:28
2

@mplungjan some people downvote all the answers to discourage giving answers to bad questions. This question is bad and should be improved or closed as a dup. (btw I'm not in this group) – mx0 Jul 12 '23 at 07:43

How does one extract any text-content from HTML-code which does contain whitespace but neither tab nor line-break?

2 Answers2