1

I'm using puppeteer in a script with node.js in order to automatically download a file from a website. My problem is that to download this file I need to click on a button but the button looks like this and I don't know how to find the url.

<button type="button" class="v-btn v-btn--fab v-btn--has-bg v-btn--round theme--light elevation-0 v-size--small transparent pa-0" aria-label="Export CSV" title="Export CSV" style="width: 32px; height: 32px;">
  <span class="v-btn__content">
    <i aria-hidden="true" class="v-icon notranslate mdi mdi-download theme--light primary--text">
    </i>
  </span>
</button>

TL;DR: I need to get the link a button points to but I can't find it on the html

elaachac
  • 43
  • 4
  • You can't find the URL, this button uses Javascript – mousetail May 12 '23 at 12:49
  • Isn't there a way to find the URL though ? I know that I can access it with puppeeter but I only need the URL, not to access it now. – elaachac May 12 '23 at 12:52
  • No, it's javascript – mousetail May 12 '23 at 12:55
  • Oh well, that's disappointing... Thanks for your help ! – elaachac May 12 '23 at 12:57
  • It's not necessarily impossible to get the URL, but we'd have to see the site. – ggorlen May 12 '23 at 13:48
  • @ggorlen this is the site https://data.ademe.fr/datasets/liste-des-entreprises-rge-2 i need to click the download button to access a modal with the button that will actually download the file – elaachac May 12 '23 at 13:52
  • Thanks. I don't see that button on the site though, `document.querySelectorAll('[aria-label="Export CSV"]')` returns an empty array in dev tools. Which button is it? – ggorlen May 12 '23 at 14:20
  • yeah, i tried to explain that but it wasn't very clear, sorry. That button is hidden, you have to first click on a button which has the title "Téléchargement des données" – elaachac May 12 '23 at 14:22
  • I found it, thanks. Looks like it uses JS to trigger a request to https://data.ademe.fr/data-fair/api/v1/datasets/liste-des-entreprises-rge-2/lines?size=10000&page=1&format=csv based on the network tab. Are you happy with this URL as-is, or do you need to get it programmatically? If you need it programmatically, I'd monitor the network request with Puppeteer before clicking the button, `waitForRequest`. – ggorlen May 12 '23 at 14:23
  • Thanks a lot ! I actually need to get it programmatically as I need to get it in a script – elaachac May 12 '23 at 14:29
  • Which other pages on this site have a similar export button, so I can understand the general pattern we're looking for? Or, if this is the only one, which pattern(s) can it be so I know what we're matching on? – ggorlen May 12 '23 at 14:31
  • I don't know as I'm only interested on this page. Thing is that I need to check everyday if the file has been updated and if so, I need to download the new file but on this page only – elaachac May 12 '23 at 14:32
  • Would the URL change? If not, I'd just paste it into your script as a string. It'll be much faster and probably less flakey than retrieving it dynamically every time you run the script. No browser automation is necessary, just a download request to that URL. – ggorlen May 12 '23 at 14:33
  • Yeah the url change everytime you download the file... That's why I'm struggling haha but actually I don't know if the url you sent would change... – elaachac May 12 '23 at 14:34
  • Weird, it seems the same for me after trying a few times. Seems like a static API endpoint, nothing randomly generated. What other URL(s) are you seeing? – ggorlen May 12 '23 at 14:37
  • well i just tried to download the file by clicking the button, the url i get looks like this ```https://data.ademe.fr/streamsaver/data.ademe.fr/416876/liste-des-entreprises-rge-2.csv``` but the numbers always change – elaachac May 12 '23 at 14:39
  • That URL gives me "This page could not be found ". I think this needs to be specced out a bit better so we're all on the same page. – ggorlen May 12 '23 at 14:40
  • yeah it gives me the same, I think it is a temporary url to download the file generated by clicking the button – elaachac May 12 '23 at 14:41
  • We must be clicking on different buttons, because I don't get that URL at all. – ggorlen May 12 '23 at 14:42
  • Because this url is the one I had by downloading the file a few hours ago – elaachac May 12 '23 at 14:42
  • That's so weird, I'm clicking on `````` – elaachac May 12 '23 at 14:45
  • which allows me to click on the other button `````` – elaachac May 12 '23 at 14:45
  • Seems to match my buttons. Maybe it's a regional thing? I'm running this from the West Coast, USA. – ggorlen May 12 '23 at 15:02
  • Damn... I'm in France so yeah maybe. – elaachac May 12 '23 at 15:04

1 Answers1

0

Based on the comments, it sounds like there may be some regional differences, but here's code that produces the same URL I see clicking on it as a user:

const puppeteer = require("puppeteer"); // ^19.7.5

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const url = "https://data.ademe.fr/datasets/liste-des-entreprises-rge-2";
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const btn = await page.waitForSelector('[aria-label="Téléchargement des données"]');
  await btn.click();
  const [request] = await Promise.all([
    page.waitForRequest(req =>
      req.url().endsWith(".csv") ||
      req.url().includes("data.ademe.fr/data-fair/api/v1/datasets/liste-des-entreprises-rge-2")
    ),
    (await page.waitForSelector('[aria-label="Export CSV"]')).click()
  ]);
  console.log(request.url());
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

The result I see is always https://data.ademe.fr/data-fair/api/v1/datasets/liste-des-entreprises-rge-2/lines?size=10000&page=1&format=csv, at least during the window I ran it in.

You can probably adjust this pattern to work for whatever URL pattern you're expecting.

To download the file, call this function with await downloadFile(req.url(), req.url()).

ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • Thanks a lot for your help, I just noticed something... I don't know if it's a regional problem but the file i get from your url is 4,6Mo, mine is 90,2Mo – elaachac May 12 '23 at 15:22
  • I thought you could adjust the `size=10000` parameter to be higher, but the API rejects that. I was hoping when you ran my script it'd plug in the correct URL for you but I guess not. Maybe there's an `offset` parameter you can use to page through all of the records with multiple requests. – ggorlen May 12 '23 at 15:27