Parsing poorly written html in javascript?

Question

I'm making a bookmarklet for a website that will need to parse multiple pages. I tried DOMParser, but it gives an error with the xml option and returns null with html. I tried jQuery, but I'm sure that's using DOMParser somewhere along the way. It does work correctly with PHP, but I'd rather not have to make twice as many requests to webpages.

I'm looking for a standalone javascript plugin to parse xml or html.

Thanks!

See - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — ChrisF, Feb 22 '12 at 23:07
@ChrisF, I'm not looking for Regex, just a parser. I haven't the slightest clue what happens behind the scenes in PHP or Javascript's built-in parsers, but are you saying it's not possible without regex, and not practical with it? — mowwwalker, Feb 23 '12 at 00:01

DG. · Accepted Answer · 2012-02-26T02:03:00.020

1

Can you not just "parse" the HTML by using the DOM?

If you need to do multiple pages from the same current page, load the other pages in an iFrame and them access the DOM like document.frame[0].contentWindow.document

EDIT: If you wish to avoid loading the external files in other pages, and also executing their script, then use Ajax (XMLHttpRequest) to get the each page. For each page use code like var newdiv = document.createElement('script'); newdiv.innerHTML = ajaxcontent; and then use the DOM to read content from newdiv. If you don't append the newdiv to the page, this should be just as lightweight as using DOMParser.

edited Feb 26 '12 at 02:03

answered Feb 25 '12 at 17:16

DG.

3,417
2
23
28

No, this needs to be done for multiple pages and would be very resource heavy. – mowwwalker Feb 25 '12 at 19:01
Please elaborate on what it is about my suggested solution makes it less appropriate than your solution? The only downside I see over your suggested solution is that loading the other pages will load external files in script. In that case, see my updated answer after "EDIT". – DG. Feb 26 '12 at 02:03
... Loading the content into the DOM would mean loading all the pictures, all the media, etc. – mowwwalker Feb 26 '12 at 06:57
Ah. Seems you are correct. I thought that if you did not append newdiv to the page, the images and other external scripts would not be loaded. You could use regex to strip out all `src` attributes before inserting the content into newdiv. A bit messy, but should work, and keeps things fast. Most simple way might be something like `ajaxcontent.replace(/ src\=/gi, ' xsrc=')` – DG. Feb 26 '12 at 10:18

Parsing poorly written html in javascript?

1 Answers1