Find and extract an html tag from a large page

Question

Figuring a very long html page as a string. How to extract a tag with its content? Any long Wikipedia page illustrates the thing

Using a parser like cheerio is excluded for performance reasons. Using any technique that will parse the entire page is excluded too for performance reasons. (like the already existing answers, please read the question before saying it's a duplicate).

The start position is easily found with indexOf("<div class='selector'>");

The issue is with the end position.

How to find where is the closing </div>, based on the start tag position? There is a lot of other div inside.

HtmlAgilityPack. Don't use regex for parsing HTML. You'll have a bad time, and many of us here at SO will be sad. Don't make us sad. — , Nov 15 '19 at 21:55
Some other engines can do this. Regex is going to struggle to do the tag balance thing usisng JS, but PCRE, Perl, etc.. can do this. Can those be used ? If just parsing tags alone, JS can be used, however the script will need to maintain a stack within a callback. Let me know if you need that. — , Nov 15 '19 at 21:57
Even within JS, you can use the DOM to parse HTML, using the `DOMParser` class. — , Nov 15 '19 at 22:01
are there another DIV tags nested inside the `
`? or you mean **inside** the document? — Ali Sheikhpour, Nov 15 '19 at 22:03
@x15 I don't feel like taking the time to write a good answer, and the OP hasn't responded to any comments yet. You can if you wish. — , Nov 15 '19 at 22:03
Please provide a ***"html page as a string"*** example and explain what you've tried already. — Pedro Lobito, Nov 15 '19 at 22:05
DOMParser or any solution parsing the entire page is excluded. For performance reasons. It took 250ms. — Slim, Nov 16 '19 at 03:07

score 0 · Answer 1 · answered Nov 16 '19 at 04:58

Raw javascript scrapage: (I put some inner SPAN tags within the finder element.

var htmlString = "<body><h1>Welcome</h1><div class='wrapper'><div>Some content here<div class='selector'>This is the element <span>you <title>want</title> to </span>extract</div></div></div></body>";

var finder = "<div class='selector'>";
var AhtmlString = htmlString.split(finder);

var back = AhtmlString[1].split("</");
var countInside = back[0].split("<"); // count how many internal tags there are
var backClose = back[countInside.length].split(">")[0]; // get the closing tag name (it'll be the first one of the last one we counted

console.log(finder + back.slice(0, countInside.length).join("</") + "</" + backClose + ">");

I would like to measure it against my solution but I was not able to set your function passing my tests. Since I am ok my mine I won't investigate more. — Slim, Nov 16 '19 at 05:17

score -1 · Answer 2 · answered Nov 15 '19 at 22:32

If I understand you correctly, you really just have a string of HTML, not an actual page with that HTML parsed.

You can easily solve the problem by loading up a temporary element with that HTML string (but never actually include it in the DOM) and then extract the part you need using the DOM API, rather than string methods.

Here's a scaled down example:

let htmlString = "<body><h1>Welcome</h1><div class='wrapper'><div>Some content here<div class='selector'>This is the element you want to extract</div></div></div></body>";

// Load the html string up into a temporary object that isn't part of the DOM
let temp = document.createElement("div");
temp.innerHTML = htmlString;

// Now use the DOM API to extract what you need:
let part = temp.querySelector("div.selector");

// Use outerHTML to get the tag iteself along with its contents
console.log(part.outerHTML);

Thanks you Scott, I am not using a browser but node. – Slim Nov 16 '19 at 03:19 — Slim, Nov 16 '19 at 03:19

Slim · Accepted Answer · 2019-11-18T07:42:46.340

This works well in 3ms instead of using a parser 250ms. Parsing all the document is really not needed.

const findTag = (body, tagStart, tagName) => {
  const startIndex = body.indexOf(tagStart)
  if (startIndex === -1) return

  const endIndex = findEndIndex(body, startIndex, tagName)
  return body.substring(startIndex, endIndex)
}

const findEndIndex = (body, startIndex, tagName) => {
  const starting = `<${tagName}`
  const closing = `</${tagName}`

  let index = startIndex + 1
  let level = 1

  do {
    const nextStartPosition = body.indexOf(starting, index)
    const nextClosingPosition = body.indexOf(closing, index)
    level += nextClosingPosition < nextStartPosition ? -1 : 1
    index = Math.min(nextClosingPosition, nextStartPosition) + 1
  } while (level !== 0)

  return index + 2 + tagName.length//to include end tag in substr
}

Find and extract an html tag from a large page

3 Answers3