Parse an HTML string with JS

Question

I want to parse a string which contains HTML text. I want to do it in JavaScript.

I tried the Pure JavaScript HTML Parser library but it seems that it parses the HTML of my current page, not from a string. Because when I try the code below, it changes the title of my page:

var parser = new HTMLtoDOM("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>", document);

My goal is to extract links from an HTML external page that I read just like a string.

Do you know an API to do it?

possible duplicate of [JavaScript DOMParser access innerHTML and other properties](http://stackoverflow.com/questions/9250545/javascript-domparser-access-innerhtml-and-other-properties) — Rob W, May 14 '12 at 14:13
The method on the linked duplicate creates a HTML document from a given string. Then, you can use `doc.getElementsByTagName('a')` to read the links (or even [`doc.links`](https://developer.mozilla.org/en/DOM/document.links)). — Rob W, May 14 '12 at 14:15
It's worth mentioning that if you're using a framework like React.js then there may be ways of doing it that are specific to the framework such as: http://stackoverflow.com/questions/23616226/insert-html-with-react-variable-statements-jsx — Mike Lyons, Mar 27 '15 at 22:26
Does this answer your question? [Strip HTML from Text JavaScript](https://stackoverflow.com/questions/822452/strip-html-from-text-javascript) — Leif Arne Storset, Mar 11 '20 at 14:10

score 498 · Accepted Answer · edited May 20 '15 at 17:42

498

Create a dummy DOM element and add the string to it. Then, you can manipulate it like any DOM element.

var el = document.createElement( 'html' );
el.innerHTML = "<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>";

el.getElementsByTagName( 'a' ); // Live NodeList of your anchor elements

Edit: adding a jQuery answer to please the fans!

var el = $( '<div></div>' );
el.html("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>");

$('a', el) // All the anchor elements

edited May 20 '15 at 17:42

omninonsense

6,644
9
45
66

answered May 14 '12 at 14:14

Florian Margaine

58,730
15
91
116

11

Just a note: With this solution, if I do a "alert(el.innerHTML)", I lose the , and tag.... – stage May 14 '12 at 15:10
2

Problem: I need to get links from tag. But with this solution, the frame tag are deleted... – stage May 21 '12 at 10:10
You can clone the `` and work on the clone. This way, you keep the original untouched and work on the cloned element (which you can delete/whatever). To clone, you can use: `var c = el.cloneNode( true );` or with jQuery: `var c = $( el ).clone();`. – Florian Margaine May 21 '12 at 10:15
I think I didn't understand because when I try it, it doesn't work: var c = el.cloneNode( true ); alert(c.innerHTML); The frame tag is still deleted – stage May 21 '12 at 10:26
It does work in there: http://jsfiddle.net/Ralt/nkPjp/ . If what you want is getting elements from an `iframe` on another domain, then it is not possible for security reasons. – Florian Margaine May 21 '12 at 12:03
I've got this: http://jsfiddle.net/aHWJ8/ i cannot grap the link ? as you can see, even the , , are deleted. – stage May 21 '12 at 13:28
The link is in the "src" of the frame. – stage May 21 '12 at 13:54
Well, that's completely different from what your question states. You should ask another question for this. – Florian Margaine May 21 '12 at 14:05
But the problem is that you can't do that. Even jQuery will strip off the `frame` tags, since it's just using `innerHTML`. I don't think using frames is a good idea btw. – Florian Margaine May 21 '12 at 14:16
But this is what I asked: "My goal is to extract links from a HTML external page that I read just like a String." I extract links from , – stage May 21 '12 at 14:16
1

In an HTML page, a link is an `anchor` tag (a), that's how everybody answered you :-). You can't get the FRAME source. innerHTML is the only way to do this, so you can't do it. Your only way would be to send the html server side with ajax so that you can work with it. – Florian Margaine May 21 '12 at 14:19
1

Thanks for posting an answer that involves vanilla Javascript! Almost in 99.999% of the cases there's no need to use jQuery! Occasionally, I get lazy and use $.get/post, but that's it. – Nick Dec 03 '13 at 06:20
5

@stage I'm a little bit late to the party, but you should be able to use `document.createElement('html');` to preserve the `` and `` tags. – omninonsense May 20 '15 at 17:21
I was afraid for ID collision, but this did not happen. Just in case another newbie was wondering the same thing. – JMRC Nov 20 '15 at 17:31
5

it looks like you are putting an html element within an html element – symbiont Aug 16 '17 at 11:39
In my case, my page needs to repeat this activity over and over again. Would repeatedly creating a dummy dom element get memory intensive? Is there a way to dispose of the dom element once the innerHtml has been extracted? I'm not quite familiar with how the browser handles javascript variables. – Glitch Mar 27 '18 at 03:42
17

I'm concerned is upvoted as the top answer. The [`parse()`](https://stackoverflow.com/questions/10585029/#55046067) solution below is more reusable and elegant. – Justin Mar 07 '19 at 17:36
6

Security note: this will execute any script in the input, and thus is unsuitable for untrusted input. – Leif Arne Storset Mar 11 '20 at 13:55
2

That's not an ideal solution, since if the html string contains images for example, the browser will try to fetch them! So this is a side effect of the parsing that we may not want: See this example: var html = "
"; var div = document.createElement('div'); div.innerHTML = html; – Nathan B Jan 30 '21 at 22:57
I can imagine that `html` is twice there, it is `created` and then `used` in `innerHtml`. – Timo Mar 31 '21 at 17:58
Doesn't work if you're going to work on the DOM afterwards (at least in Firefox). Doing `el.ownerDocument.getElementById('some-id-with-new-html').innerHTML` returns the old HTML instead of the new. – TheStoryCoder Aug 19 '22 at 10:14
If you don't want it to load images and whatnot, use a DOMParser (see other answers) – jxxe Sep 04 '22 at 17:31

score 404 · Answer 2 · edited Dec 07 '22 at 16:15

404

It's quite simple:

const parser = new DOMParser();
const htmlDoc = parser.parseFromString(txt, 'text/html');
// do whatever you want with htmlDoc.getElementsByTagName('a');

According to MDN, to do this in chrome you need to parse as XML like so:

const parser = new DOMParser();
const htmlDoc = parser.parseFromString(txt, 'text/xml');
// do whatever you want with htmlDoc.getElementsByTagName('a');

~~It is currently unsupported by webkit and you'd have to follow Florian's answer, and it is unknown to work in most cases on mobile browsers.~~

Edit: Now widely supported

edited Dec 07 '22 at 16:15

Luís Soares

5,726
4
39
66

answered Feb 19 '14 at 03:28

Cilan

13,101
3
34
51

48

Worth noting that in 2016 DOMParser is now widely supported. http://caniuse.com/#feat=xml-serializer – aendra Mar 09 '16 at 11:21
7

Worth noting that all relative links in the created document are broken, because the document gets created by inheriting the `documentURL` of `window`, which most likely differs from the URL of the string. – ceving Nov 03 '17 at 00:17
3

Worth noting that you should *only* call `new DOMParser` once and then reuse that same object throughout the rest of your script. – Jack G May 19 '18 at 17:36
1

The [`parse()`](https://stackoverflow.com/questions/10585029/parse-an-html-string-with-js#55046067) solution below is more reusable and specific to HTML. This is nice if you need an XML document, however. – Justin Mar 07 '19 at 17:39
How can I display this parsed webpage on a dialog box or something? I was not able to find solution for that – Shariq Musharaf Jun 20 '19 at 09:14
Security note: this will execute without any browser context, so no scripts will run. It should be suitable for untrusted input. – Leif Arne Storset Mar 11 '20 at 13:56
how to convert HTML to string using javascript? – Hardik Mandankaa Jul 08 '20 at 08:29
@HardikMandankaa `html` `IS` a string, so no need to convert. It is already there as string rep. – Timo Mar 31 '21 at 17:53
Since chrome 31, `text/html` is possible. I wonder if there is anybody using this version of chrome or lower.. – Timo Mar 31 '21 at 17:56

score 39 · Answer 3 · edited Jul 09 '21 at 14:28

EDIT: The solution below is only for HTML "fragments" since html,head and body are removed. I guess the solution for this question is DOMParser's parseFromString() method:

const parser = new DOMParser();
const document = parser.parseFromString(html, "text/html");

For HTML fragments, the solutions listed here works for most HTML, however for certain cases it won't work.

For example try parsing <td>Test</td>. This one won't work on the div.innerHTML solution nor DOMParser.prototype.parseFromString nor range.createContextualFragment solution. The td tag goes missing and only the text remains.

Only jQuery handles that case well.

So the future solution (MS Edge 13+) is to use template tag:

function parseHTML(html) {
    var t = document.createElement('template');
    t.innerHTML = html;
    return t.content;
}

var documentFragment = parseHTML('<td>Test</td>');

For older browsers I have extracted jQuery's parseHTML() method into an independent gist - https://gist.github.com/Munawwar/6e6362dbdf77c7865a99

If you want to write forward-compatible code that also works on old browsers you can [polyfill the ` — Jeff Laughlin, Sep 29 '17 at 17:06

Mathieu · Answer 4 · 2020-03-12T14:30:59.633

29

var doc = new DOMParser().parseFromString(html, "text/html");
var links = doc.querySelectorAll("a");

edited Mar 12 '20 at 14:30

answered May 14 '12 at 14:18

Mathieu

5,495
2
31
48

4

Why are you prefixing `$`? Also, as mentioned in the [linked duplicate](http://stackoverflow.com/questions/9250545/javascript-domparser-access-innerhtml-and-other-properties), `text/html` is not supported very well, and has to be implemented using a polyfill. – Rob W May 15 '12 at 13:08
1

I copied this line from a project, I'm used to prefix variables with $ in javascript application (not in library). it's just to avoir having a conflict with a library. that's not very usefull as almost every variable is scoped but it used to be usefull. it also (maybe) help to identify variables easily. – Mathieu May 15 '12 at 13:23
1

Sadly `DOMParser` neither work on `text/html` in chrome, [this MDN page](https://developer.mozilla.org/en-US/docs/DOM/DOMParser#DOMParser_HTML_extension_for_other_browsers) gives workaround. – Jokester Apr 24 '13 at 16:51
1

Security note: this will execute without any browser context, so no scripts will run. It should be suitable for untrusted input. – Leif Arne Storset Mar 11 '20 at 13:57

score 7 · Answer 5 · edited Jun 20 '20 at 09:12

7

The following function parseHTML will return either :

a Document when your file starts with a doctype.
a DocumentFragment when your file doesn't start with a doctype.

The code :

function parseHTML(markup) {
    if (markup.toLowerCase().trim().indexOf('<!doctype') === 0) {
        var doc = document.implementation.createHTMLDocument("");
        doc.documentElement.innerHTML = markup;
        return doc;
    } else if ('content' in document.createElement('template')) {
       // Template tag exists!
       var el = document.createElement('template');
       el.innerHTML = markup;
       return el.content;
    } else {
       // Template tag doesn't exist!
       var docfrag = document.createDocumentFragment();
       var el = document.createElement('body');
       el.innerHTML = markup;
       for (i = 0; 0 < el.childNodes.length;) {
           docfrag.appendChild(el.childNodes[i]);
       }
       return docfrag;
    }
}

How to use :

var links = parseHTML('<!doctype html><html><head></head><body><a>Link 1</a><a>Link 2</a></body></html>').getElementsByTagName('a');

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 09 '13 at 03:38

John Slegers

45,213
22
199
169

I couldn't get this to work on IE8. I get the error "Object doesn't support this property or method" for the first line in the function. I don't think the createHTMLDocument function exists – Sebastian Carroll Jan 10 '14 at 06:21
What exactly is your use case? If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document.createElement("DIV"); (2) div.innerHTML = markup; (3) result = div.childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7. – John Slegers Jan 14 '14 at 15:03
Thanks for the alternate option, I'll try it if I need to do this again. For now though I used the JQuery solution above. – Sebastian Carroll Jan 22 '14 at 22:04
@SebastianCarroll Note that IE8 doesn't support the `trim` method on strings. See http://stackoverflow.com/q/2308134/3210837. – Toothbrush Dec 24 '16 at 21:02
3

@Toothbrush : Is IE8 support still relevant at the dawn of 2017? – John Slegers Dec 29 '16 at 14:53
@JohnSlegers For some companies, yes. – Toothbrush Dec 29 '16 at 15:36

Joel · Answer 6 · 2015-02-08T05:15:57.797

7

The fastest way to parse HTML in Chrome and Firefox is Range#createContextualFragment:

var range = document.createRange();
range.selectNode(document.body); // required in Safari
var fragment = range.createContextualFragment('<h1>html...</h1>');
var firstNode = fragment.firstChild;

I would recommend to create a helper function which uses createContextualFragment if available and falls back to innerHTML otherwise.

Benchmark: http://jsperf.com/domparser-vs-createelement-innerhtml/3

edited Feb 08 '15 at 05:15

answered Feb 08 '15 at 04:41

Joel

15,496
7
52
40

Note that, like (the simple) `innerHTML`, this will execute an ``’s `onerror`. – Ry- Aug 28 '15 at 22:54
1

An issue with this is that, html like 'test' would ignore the td in the document.body context (and only create 'test' text node).OTOH, if it used internally in a templating engine then the right context would be available. – Munawwar Oct 05 '15 at 21:47
Also BTW, IE 11 supports createContextualFragment. – Munawwar Oct 05 '15 at 21:49
The question was how to parse with JS - not Chrome or Firefox – sea26.2 Apr 19 '19 at 01:38
4

Security note: this will execute any script in the input, and thus is unsuitable for untrusted input. – Leif Arne Storset Mar 11 '20 at 13:55

anthumchris · Answer 7 · 2019-07-21T21:33:07.033

const parse = Range.prototype.createContextualFragment.bind(document.createRange());

document.body.appendChild( parse('<p><strong>Today is:</strong></p>') ),
document.body.appendChild( parse(`<p style="background: #eee">${new Date()}</p>`) );

Only valid child Nodes within the parent Node (start of the Range) will be parsed. Otherwise, unexpected results may occur:

// <body> is "parent" Node, start of Range
const parseRange = document.createRange();
const parse = Range.prototype.createContextualFragment.bind(parseRange);

// Returns Text "1 2" because td, tr, tbody are not valid children of <body>
parse('<td>1</td> <td>2</td>');
parse('<tr><td>1</td> <td>2</td></tr>');
parse('<tbody><tr><td>1</td> <td>2</td></tr></tbody>');

// Returns <table>, which is a valid child of <body>
parse('<table> <td>1</td> <td>2</td> </table>');
parse('<table> <tr> <td>1</td> <td>2</td> </tr> </table>');
parse('<table> <tbody> <td>1</td> <td>2</td> </tbody> </table>');

// <tr> is parent Node, start of Range
parseRange.setStart(document.createElement('tr'), 0);

// Returns [<td>, <td>] element array
parse('<td>1</td> <td>2</td>');
parse('<tr> <td>1</td> <td>2</td> </tr>');
parse('<tbody> <td>1</td> <td>2</td> </tbody>');
parse('<table> <td>1</td> <td>2</td> </table>');

Security note: this will execute any script in the input, and thus is unsuitable for untrusted input. — Leif Arne Storset, Mar 11 '20 at 13:55

score 6 · Answer 8 · answered Dec 09 '21 at 19:50

I think the best way is use this API like this:

//Table string in HTML format
const htmlString = '<table><tbody><tr><td>Cell 1</td><td>Cell 2</td></tr></tbody></table>';

//Parse using DOMParser native way
const parser = new DOMParser();
const $newTable = parser.parseFromString(htmlString, 'text/html');

//Here you can select parts of your parsed html and work with it
const $row = $newTable.querySelector('table > tbody > tr');

//Here i'm printing the number of columns (2)
const $containerHtml = document.getElementById('containerHtml');
$containerHtml.innerHTML = ['Your parsed table have ', $row.cells.length, 'columns.'].join(' ');

<div id="containerHtml"></div>

Юрий Светлов · Answer 9 · 2020-11-29T06:02:59.140

1 Way

Use document.cloneNode()

Performance is:

Call to document.cloneNode() took ~0.22499999977299012 milliseconds.

and maybe will be more.

var t0, t1, html;

t0 = performance.now();
   html = document.cloneNode(true);
t1 = performance.now();

console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")

html.documentElement.innerHTML = '<!DOCTYPE html><html><head><title>Test</title></head><body><div id="test1">test1</div></body></html>';

console.log(html.getElementById("test1"));

2 Way

Use document.implementation.createHTMLDocument()

Performance is:

Call to document.implementation.createHTMLDocument() took ~0.14000000010128133 milliseconds.

var t0, t1, html;

t0 = performance.now();
html = document.implementation.createHTMLDocument("test");
t1 = performance.now();

console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")

html.documentElement.innerHTML = '<!DOCTYPE html><html><head><title>Test</title></head><body><div id="test1">test1</div></body></html>';

console.log(html.getElementById("test1"));

3 Way

Use document.implementation.createDocument()

Performance is:

Call to document.implementation.createHTMLDocument() took ~0.14000000010128133 milliseconds.

var t0 = performance.now();
  html = document.implementation.createDocument('', 'html', 
             document.implementation.createDocumentType('html', '', '')
         );
var t1 = performance.now();

console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")

html.documentElement.innerHTML = '<html><head><title>Test</title></head><body><div id="test1">test</div></body></html>';

console.log(html.getElementById("test1"));

4 Way

Use new Document()

Performance is:

Call to document.implementation.createHTMLDocument() took ~0.13499999840860255 milliseconds.

Note

ParentNode.append is experimental technology in 2020 year.

var t0, t1, html;

t0 = performance.now();
//---------------
html = new Document();

html.append(
  html.implementation.createDocumentType('html', '', '')
);
    
html.append(
  html.createElement('html')
);
//---------------
t1 = performance.now();

console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")

html.documentElement.innerHTML = '<html><head><title>Test</title></head><body><div id="test1">test1</div></body></html>';

console.log(html.getElementById("test1"));

score 5 · Answer 10 · answered Sep 16 '21 at 04:45

5

To do this in node.js, you can use an HTML parser like node-html-parser. The syntax looks like this:

import { parse } from 'node-html-parser';

const root = parse('<ul id="list"><li>Hello World</li></ul>');

console.log(root.firstChild.structure);
// ul#list
//   li
//     #text

console.log(root.querySelector('#list'));
// { tagName: 'ul',
//   rawAttrs: 'id="list"',
//   childNodes:
//    [ { tagName: 'li',
//        rawAttrs: '',
//        childNodes: [Object],
//        classNames: [] } ],
//   id: 'list',
//   classNames: [] }
console.log(root.toString());
// <ul id="list"><li>Hello World</li></ul>
root.set_content('<li>Hello World</li>');
root.toString();    // <li>Hello World</li>

answered Sep 16 '21 at 04:45

Daniel Kaplan

62,768
50
234
356

1

This is the best solution even on the browser, if you do not want to rely on the browser implementation.. This implementation will behave always the same no matter which browser you are on (not that it matters much nowdays), but also the parsing is done in javascript itself instead of c/c++! – Rainb Jul 26 '22 at 09:50
Thanks @Rainb. How do you use the solution in the browser though? – Daniel Kaplan Jul 26 '22 at 19:59
1
like this `(await import("https://cdn.skypack.dev/node-html-parser")).default('
- Hello World
').firstChild.structure`
– Rainb Jul 27 '22 at 16:10
I never knew that was an option. Can you do that with any node library, or is it because this one doesn't use any node-only code? – Daniel Kaplan Jul 29 '22 at 00:49
1

if it requires anything from node like tls, http, net, fs then it probably won't work in the browser. But it won't work in deno either. So just look for deno compatible packages. – Rainb Jul 29 '22 at 07:47

score 4 · Answer 11 · answered May 14 '12 at 14:17

If you're open to using jQuery, it has some nice facilities for creating detached DOM elements from strings of HTML. These can then be queried through the usual means, E.g.:

var html = "<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>";
var anchors = $('<div/>').append(html).find('a').get();

Edit - just saw @Florian's answer which is correct. This is basically exactly what he said, but with jQuery.

score 1 · Answer 12 · answered Oct 14 '22 at 21:14

const html =
`<script>
    alert(' there ! Wanna grab a '); 
</script>`;

const scriptEl = document.createRange().createContextualFragment(html);
parent.append(scriptEl);

I found this solution, and i think it's the best solution, it parse the HTML and execute the script inside.

score 0 · Answer 13 · edited Aug 26 '21 at 18:23

I had to use innerHTML of an element parsed in popover of Angular NGX Bootstrap popover. This is the solution which worked for me.

public htmlContainer = document.createElement( 'html' );

in constructor

this.htmlContainer.innerHTML = ''; setTimeout(() => { this.convertToArray(); });

 convertToArray() {
    const shapesHC = document.getElementsByClassName('weekPopUpDummy');
    const shapesArrHCSpread = [...(shapesHC as any)];
    this.htmlContainer = shapesArrHCSpread[0];
    this.htmlContainer.innerHTML = shapesArrHCSpread[0].textContent;
  }

in html

<div class="weekPopUpDummy" [popover]="htmlContainer.innerHTML" [adaptivePosition]="false" placement="top" [outsideClick]="true" #popOverHide="bs-popover" [delay]="150" (onHidden)="onHidden(weekEvent)" (onShown)="onShown()">

score 0 · Answer 14 · answered Dec 29 '21 at 13:12

0

function parseElement(raw){
    let el = document.createElement('div');
    el.innerHTML = raw;
    let res = el.querySelector('*');
    res.remove();
    return res;
}

note: raw string should not be more than 1 element

answered Dec 29 '21 at 13:12

Weilory

2,621
19
35

score -1 · Answer 15 · answered Oct 07 '20 at 10:53

-1

let content = "<center><h1>404 Not Found</h1></center>"
let result = $("<div/>").html(content).text()

content: <center><h1>404 Not Found</h1></center>,
result: "404 Not Found"

answered Oct 07 '20 at 10:53

Den Nikitin

39
7

This does not answer the Quest. OP wants to extract links. – Rene Koch Oct 07 '20 at 11:47

Parse an HTML string with JS

15 Answers15

The code :

How to use :

Linked

Related