The best way to parse HTML is to use the DOM. But, if all you have is a string of HTML, according to this Stackoverflow member) you may create a "dummy" DOM element to which you'd add the string to be able to manipulate the DOM, as follows:
var el = document.createElement( 'html' );
el.innerHTML = "<html><head><title>aTitle</title></head>
<body><div><h2>This is a heading1</h2><h2>This is a heading2</h2></div>
</body</html>";
Now you have a couple of ways to access the data using the DOM, as follows:
var el = document.createElement( 'html' );
el.innerHTML = "<html><head><title>aTitle</title></head><body><div><h2>This is a heading1</h2><h2>This is a heading2</h2></div></body</html>";
// one way
el.g = el.getElementsByTagName;
var h2s = el.g("h2");
for(var i = 0, max = h2s.length; i < max; i++){
console.log(h2s[i].textContent);
if (i == max -1) console.log("\n");
}
// and another
var elementList = el.querySelectorAll("h2");
for (i = 0, max = elementList.length; i < max; i++) {
console.log(elementList[i].textContent);
}
You may also use a regular expression, as follows:
var str = '<div><h2>This is a heading1</h2><h2>This is a heading2</h2></div>';
var re = /<h2>([^<]*?)<\/h2>/g;
var match;
var m = [];
var i=0;
while ( match = re.exec(str) ) {
m.push(match.pop());
}
console.log(m);
The regex consists of an opening H2 tag followed by not a "<",followed by a closing H2 tag. The "*?" take into account zero or multiple instances of which there is at least zero or one instance.
Per Ryan of Stackoverflow:
exec with a global regular expression is meant to be used in a loop,
as it will still retrieve all matched subexpressions.
The critical part of the regex is the "g" flag as per MDN. It allows the exec() method to obtain multiple matches in a given string. In each loop iteration, match becomes an array containing one element. As each element is popped off and pushed onto m, the array m ultimately contains all the captured text values.