My chrome extension scrapes a variety of web pages. I haven't found an approach that fully works yet. What I've tried, that is close:
From the background script, I can
fetch
, and then run the html through htmlparser2 to parse it (I can't get a document, but for simple extraction this is OK). This is fine for static sites, but doesn't work for sites that render content with javascript.I can create a tab with extension-supplied html, and in the tab load the targets that I'm attempting to scrape in an iframe (after using
declarativeNetRequest
to removeX-Frame-Options
and related headers). Unfortunately, I then run into same-origin policy, which means that I can't access the content of the iframe - specifically,iframe.contentDocument
ends up as null. I tried injecting a script into the iframe usingchrome.scripting.executeScript
, thinking I could post a message and get it to respond, but I don't have permission to inject scripts on chrome-extension:// tabs, even though it's my own tab! (This seems dumb, but maybe by design.)
I know I could create a new tab per url I want to scrape; however, in order to do that, I'd need a lax contentScripts policy (I have dozens of urls), and I really don't want to be injecting a contentScript into the user's regular browsing tabs (although I will if I find no other solution). Also, the distraction of tabs showing and hiding, or the favicon / title on the tab changing, is pretty poor UX.
Firefox has hidden tabs, which would be nice, but they're not supported in Chrome.
Is there a cleaner approach?