2

I want to extract data from an HTML string in a Web Worker.

I want to clarify that I do not want to manipulate the DOM. I am sending an HTML string to the Web Worker, which then should extract data from the HTML, and then return the extracted data.

In the browser I could do:

  var html = $("<body><div>...more html...</div></body>");

  var extractedText = $(".selector", html).text();

My Question:

What is the equivalent of the above code in a Web Worker environment if given the same HTML string? There's no jQuery, no DOMParser, no querySelector.. in the Web Worker etc. Are there alternatives?

The Why:

I'm doing on the fly scraping of pages in a browser and don't want to block the UI thread because it's pretty heavy work.

I've looked at jsdom, cheerio, etc. but could not figure out how to make them work.

Regarding suggested duplicates:

I have reviewed both of the suggested duplicates and they are ones that I have read before while searching for answers to this question. They address XML parsing and not HTML parsing, and also do not address how to use CSS-selection inside Web Workers.

yehyaawad
  • 126
  • 1
  • 11
  • Possible duplicate of [Parsing XML in a Web Worker](http://stackoverflow.com/questions/10494632/parsing-xml-in-a-web-worker) and [Parsing XML in a Web Worker](http://stackoverflow.com/questions/9133918/parsing-xml-in-web-workers) – Kaiido Mar 31 '16 at 04:28
  • @Kaiido I have checked both, they do not solve this problem. – yehyaawad Mar 31 '16 at 05:12
  • @dandavis querySelector & jQuery's AJAX do not exist in the Web Worker environment, sadly. – yehyaawad Mar 31 '16 at 05:29
  • @dandavis Can you give me the code/docs for that? I've looked endlessly but could not find anything. – yehyaawad Mar 31 '16 at 05:35
  • @dandavis My main problem is extracting information from the HTML, is there something that addresses that? – yehyaawad Mar 31 '16 at 05:38
  • i meant to try specifying that ajax should give you back an HTML document, but i just tried it and it doesn't work inside of workers, only main windows... sorry to waste time. you could do the ajax part in a worker, transfer it to the window to do the CSS selection, then send a big messy array of strings back to the worker for cpu-intensive cleanup. the dom selection part should be fairly quick. – dandavis Mar 31 '16 at 05:47
  • @yehyaawad the main point of both dupes I linked is that **you can't do DOM manipulation in a worker**. They do provide some libraries, that I didn't tested, but we don't do library suggestion here anyway. So the answer to your question is "it's impossible with native API, you'll have to use a js library that does the parsing and querying from scratch, hence, there is no native equivalent." But that's intrinsically contained in both dupes so no need to answer your question. – Kaiido Mar 31 '16 at 06:01
  • @Kaiido I'm not trying to change the DOM in any way, I actually don't care if there is a DOM or not. My questions is how can I extract data from this HTML string that contains data that I need. Can I turn it into an XML and search using XPath? Can I parse it in some way? Can I use CSS Selectors? Can I use Regex? I'm looking for anyone who has dealt with this problem before. I'm not asking for native API or JS Library, I'm asking for anything that can solve this problem. – yehyaawad Mar 31 '16 at 07:03
  • 1
    You were asking for css selectors selection + DOM property reading equivalent, which are CSSOM/DOM operations, unavailable in workers. Your only option is regular string operations/regex, which [is bad](http://stackoverflow.com/a/1732454/3702797) for markup languages. – Kaiido Mar 31 '16 at 07:10

2 Answers2

0

Short answer:

You cannot do any sort of HTML/CSS manipulation, including querying, in a web worker.

Long answer:

There are many DOMs. There's the main DOM, which is rendered on the page, but everything that a browser does that touches an HTML or XML tree, including querySelector and friends, requires the browser to build a DOM for that tree. (see also: DocumentFragment)

One of Mozilla's developers talked a bit about some reasons why they can't build any DOMs on worker threads (found via this question, on nabble):

You're assuming that none of the DOM implementation code uses any sort of non-DOM objects, ever, or that if it does those objects are fully threadsafe. That's just not not the case, at least in Gecko.

The issue in this case is not the same DOM object being touched on multiple threads. The issue is two DOM objects on different threads both touching some global third object.

For example, the XML parser has to do some things that in Gecko can only be done on the main thread (DTD loading, offhand; there are a few others that I've seen before but don't recall offhand).

So. We obviously can't use querySelector, createElement, or anything useful in a worker, so what can we do?

Build our own DOM parser/selector modules, of course!

Not really. Try including a copy of htmlparser2 in your worker, maybe via browserify (making that work is its own question). With that, and with CSSselect to allow querySelector-like selecting, you should be ready to go.

Admittedly, you can't use jQuery with those, but for simple querying needs they (and querySelector/querySelectorAll) should be more than sufficient.

Community
  • 1
  • 1
Hawken MacKay Rives
  • 1,171
  • 1
  • 16
  • 26
0

You can make dom selection inside worker but you will need to create an API that will use post message to change data between main tread and worker (because you can't use DOM directly in worker). The limitation is that you will need to pass strings between, so you can't return Dom Nodes, unless you have some code that will create DOM nodes in worker based on data from main tread.

Because JavaScript is dynamic it should be easy to create dynamic wrapper that will create all those functions for you, and will allow to call querySelelector('.foo') and expose all the Dom APIs. With proxy objects you can even allow to use querySelelector('.foo').innerHTML = 'hello'; in worker with proper code.

There is library that make creating such API easier Comlink from Google. If you don't want to use library you can check this code, this git web terminal that expose isomorphic git functions using RPC like code to worker (It's inspired by Jason's Miller workerize).

and quick search give this project that looks promising "Worker DOM", it should give you DOM api in worker (that I'm almost sure use solution I proposed) but I didn't check it and I'm not sure how it works.

With some bit of work you may even have working jQuery inside worker, it would good project to make open source.

jcubic
  • 61,973
  • 54
  • 229
  • 402