Clean way to scrape web pages from Manifest V3 chrome extension

Question

My chrome extension scrapes a variety of web pages. I haven't found an approach that fully works yet. What I've tried, that is close:

From the background script, I can fetch, and then run the html through htmlparser2 to parse it (I can't get a document, but for simple extraction this is OK). This is fine for static sites, but doesn't work for sites that render content with javascript.
I can create a tab with extension-supplied html, and in the tab load the targets that I'm attempting to scrape in an iframe (after using declarativeNetRequest to remove X-Frame-Options and related headers). Unfortunately, I then run into same-origin policy, which means that I can't access the content of the iframe - specifically, iframe.contentDocument ends up as null. I tried injecting a script into the iframe using chrome.scripting.executeScript, thinking I could post a message and get it to respond, but I don't have permission to inject scripts on chrome-extension:// tabs, even though it's my own tab! (This seems dumb, but maybe by design.)

I know I could create a new tab per url I want to scrape; however, in order to do that, I'd need a lax contentScripts policy (I have dozens of urls), and I really don't want to be injecting a contentScript into the user's regular browsing tabs (although I will if I find no other solution). Also, the distraction of tabs showing and hiding, or the favicon / title on the tab changing, is pretty poor UX.

Firefox has hidden tabs, which would be nice, but they're not supported in Chrome.

Is there a cleaner approach?

score 3 · Accepted Answer · answered May 17 '23 at 05:04

3

Use chrome.offscreen API to create a hidden document with access to DOM
Add a rule to strip X-Frame-Options
For each site:
1. register a content script that runs in the url of the site using chrome.scripting.registerContentScripts with allFrames: true and persistAcrossSessions: false
2. in the offscreen document create an iframe inside pointing to the site
3. process its DOM inside your content script
4. send the results back via messaging
5. in the offscreen document remove the iframe
6. unregister the content script

To make the content script run only inside your iframe:

Add a dummy random id to the URL and use it when registering the content script
```
let u = new URL(url);
u.searchParams.set(Math.random(), '')
url = u.href;
```
Theoretically an unknown parameter may be rejected by some site but it's unlikely.

Wrap the entire content script in a condition:

if (location.ancestorOrigins.contains(chrome.runtime.getURL('').slice(0, -1)) {
   .....
}

answered May 17 '23 at 05:04

wOxxOm

65,848
11
132
136

@w0xx0m - thanks - I was looking at the offscreen API but couldn't figure out how to use it. How do you get a handle on the offscreen document/ – user717847 May 18 '23 at 10:10
There's no handle. You can use chrome.runtime messaging or navigator.serviceWorker messaging to control the document's behavior. There are [multiple examples around](https://stackoverflow.com/search?q=chrome.offscreen.createDocument+is%3Aanswer), so if those aren't helpful you can ask a question about a specific problem with the API. – wOxxOm May 18 '23 at 10:29
@wOxxOm Thanks for these steps, I've tried implementing them and I think I've gotten it mostly working, although it seems very hard to debug as the console.logs don't seem to go anywhere for offscreen and iframe. I'm assuming that this method doesn't help you get past the problem where the targeted site is detecting if it's loading in an iframe and blocking that? (I'm getting ERR_BLOCKED_BY_RESPONSE returned). – Ivan Jun 07 '23 at 13:27
You can debug the offscreen document in chrome://extensions inside the details page of your extension. – wOxxOm Jun 07 '23 at 16:34

Clean way to scrape web pages from Manifest V3 chrome extension

1 Answers1