0

I want to create a tampermonkey script that is registered on one page (call it A). From this page (it is an overview page), it extracts a series of links (say [B, C, D]). This is working so far.

Now, I want to do the following:

  1. Navigate to location B.
  2. Wait for the DOM to become ready, so I can extract further information
  3. Parse some information from the page and store them in some object/array.
  4. Repeat steps 1 through 3 with the URLs C and D
  5. Go back to address A
  6. Copy the content of out to the clipboard

The tasks 1 I can achieve by window.open or window.location. But I am failing at steps 2 and 3 currently.

Is this even possible? I am unsure if waiting for another page will terminate and unload the current script.

Can you point me into the correct direction to get that issue solved?

If you have any better idea, I am willing to hear them. The reason I am using the browser with tampermonkey is that the pages use some sort of CSRF protection means that will not allow me to use e.g. curl to extract the relevant data.

I have seen this answer. As far as I understand it, this will start a new script on each invocation and I had to pass all information using URL parameters manually. It might be doable (unless the server is messing with the params) but seems to be some effort. Is there a simpler solution?

Christian Wolf
  • 1,187
  • 1
  • 12
  • 33

1 Answers1

0

To transfer information, there are a few options.

  • URL parameters, as you mentioned - but that could get messy
  • Save the values and a flag in Tampermonkey's shared storage using GM_setValue
  • If you open the windows to scrape using window.open, you can have the child windows call .postMessage while the parent window listens for messages (including for those from other domains). (BroadcastChannel is a nice flexible option, but it's probably overkill here)

It sounds like your userscript needs to be able to run on arbitrary pages, so you'll probably need // @match *://*/*, as well as a way to indicate to the script that the page that was automatically navigated to is one to scrape.

When you want to start scraping, open the target page with window.open. (An iframe would be more user-friendly, but that will sometimes fail due to the target site's security restrictions.) When the page opens, your userscript can have the target page check if window.opener exists, or if there's a URL parameter (like scrape=true), to indicate that it's a page to be scraped. Scrape the information, then send it back to the parent using .postMessage. Then the parent can repeat the process for the other links. (You could even process all links in parallel, if they're on different domains and it won't overload your browser.)

Waiting for the DOM to be ready should be trivial. If the page is fully populated at the end of HTML parsing, then all your script needs is to not have @run-at document-start, and it'll run once the HTML is loaded. If the page isn't fully populated at the end of HTML parsing, and you need to wait for something else, just have a timeout loop until the element you need exists.

protection means that will not allow me to use e.g. curl to extract the relevant data.

Rather than a userscript, running this on your own server would be more reliable and somewhat easier to manage, if it's possible. Consider checking if something more sophisticated can curl could work - for example, puppeteer, which can emulate a full browser.

CertainPerformance
  • 356,069
  • 52
  • 309
  • 320