0

I am using and and trying to scrape the total number of holders from this website. I can scrape this either from under the Market Overview section

enter image description here

or the token distribution section under the analysis tab at the bottom.

enter image description here

However, no matter how I traverse the DOM, I am unable to scrape the holders information.

This is how I am trying to scrape from the token distribution section but it returns nothing:

async function getHolders() {
    try {
      const { data } = await axios.get(url);
      const $ = cheerio.load(data);
      const holders = $("div.ant-typography span.sc-doKvHv.hAxaGu").text();
      console.log(holders);
    } catch (err) {
      console.error(err);
    }
  }
ggorlen
  • 44,755
  • 7
  • 76
  • 106
mzaidi
  • 109
  • 1
  • 10
  • I'm surprised you're able to get any information at all with axios and cheerio. It looks like a single-page React app, basically entirely JS-driven, so it needs to be scraped with a browser automation library like Playwright or Puppeteer, or requests may be able to be made directly to the site's API, if it's unsecured (which appears not to be the case). What data are you trying to get from the holders tab exactly? It looks like a lazy-loaded table. Can you share the code you're using to scrape the data you claim to be able to scrape? Also, why is this tagged Java rather than JavaScript? Thanks – ggorlen Mar 15 '23 at 14:52
  • @ggorlen, It absolutely does not need to be scraped with browser automation. How do you think browsers work under the hood? – Conor Reid Mar 15 '23 at 20:18
  • @ConorReid How browsers work under the hood seems irrelevant here, or perhaps I don't understand your hint. How do you propose scraping it? "Need" is used somewhat loosely; I'm sure there are other ways but nothing obvious or feasible I can think of. – ggorlen Mar 15 '23 at 20:20
  • The problem with browser automation is how overkill the solution is. Data is transferred over the internet through HTTP and only HTTP. That means, no matter what the website is, it can be scraped using HTTP alone, Everything else (html/js/css) is just for aesthetics. Rendering it through browser automation is just extra work/cpu/ram and more CO2! I wrote an answer with the best method for this. I see you recommending this on multiple Cheerio posts and it's leading people down the wrong path. – Conor Reid Mar 15 '23 at 20:41
  • +(Data isn't transferred over the internet only through HTTP, but for websites it is) – Conor Reid Mar 15 '23 at 20:47
  • @ConorReid Thanks, but I'm well aware of all of that and even [made a note of it in a blog post](https://serpapi.com/blog/puppeteer-antipatterns/#using-puppeteer-when-other-tools-are-more-appropriate). I have [dozens of answers](https://stackoverflow.com/search?q=user%3A6243352+%5Bpuppeteer%5D+fetch+cheerio) suggesting fetch+cheerio rather than Puppeteer. In my first comment, you might have missed that I mentioned that you can make a request directly to the site's API. – ggorlen Mar 15 '23 at 21:11
  • "That means, no matter what the website is, it can be scraped using HTTP alone" is not correct. Site APIs are often not feasible to scrape from using plain HTTP requests for a variety of reasons. Usually, they're secured and/or require same-site authentication, essentially only serving to a client holding a token on that domain. That's why I default to the more reliable approach, and premature optimization can be harmful and confusing for those unfamiliar with when it is and isn't easy to access an API. Here, I saw a token so I assumed it's not stable. – ggorlen Mar 15 '23 at 21:12
  • Ironically, someone called me out in my [tutorial on doing what you're suggesting](https://stackoverflow.com/a/66878732/6243352), kindly [informing me that it didn't work on their particular site](https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python#comment132088076_66878732), to which I responded that it's not intended to be general-purpose. – ggorlen Mar 15 '23 at 21:19
  • Looking at it again, the token appears to be a reference identifier of some sort for the coin rather than an access token, so OP may be in luck that this is an unsecured API (for now). I haven't tried a request yet, and it'd still be useful to see exactly what data OP wants. – ggorlen Mar 15 '23 at 21:22
  • As a counterexample challenge/bounty to help prove my point: please show me how to programmatically enter a query into https://chat.openai.com/ and get a response on the terminal, without OpenAI's API or a browser automation library. If you can do it with fetch/axios and Cheerio, I'll eat my shoe! (I'll also eat my other shoe if OP shows how they've scraped this React site with fetch/Cheerio without using the API) – ggorlen Mar 15 '23 at 21:37
  • @gglorlen, OpenAI is protected by Cloudflare. In order to make a request, you would either need to solve the Javascript security challenge or match the Browser's TLS depending on OpenAI's Cloudflare settings (or find a vulnerability such as an unprotected origin IP). Most people would use a browser emulator in this case because it often makes it easy to beat low-level security using something like `puppeteer-stealth`, removing the need to reverse the fingerprinting scripts. If you'd like to know more about antibots, firewalls and bypasses, you can follow my Twitter @unreleased – Conor Reid Mar 16 '23 at 03:33

1 Answers1

0

As @ggorlen mentioned, you can't directly scrape that page because it's a Single Page application. SPA's will use cross-site requests to load in data asynchronously on page loads/actions. However, I massively disapprove of using browser automation because it's completely overkill and causes more problems than it solves (In most cases!)

You can view these XHR requests in your Developer Console under the "Network" tab. (I recommend filtering by XHR)

The great news is, it's even easier than that! You will need to make requests to their JSON API to download the data instead.

https://api.solscan.io/token/meta?token=EPjFWdd5AufqSSqeM2qN1xzybapC8G4wEGGkZwyTDt1v&cluster=

This URL will have the holders amount under data.holder

For the current supply and market cap, you will need to make a request to a different endpoint. That is: https://api.solscan.io/account?address=EPjFWdd5AufqSSqeM2qN1xzybapC8G4wEGGkZwyTDt1v&cluster=

The supply is the data.tokenInfo.supply, and the market cap is that multiplied by the current price.

Hope this helps.

Conor Reid
  • 578
  • 3
  • 16