0

I'm trying to scrape one table from https://www.nba.com/stats/teams/isolation?PerMode=Totals&TypeGrouping=offensive and I keep getting empty array. I want to scrape every td inside every tr for tbody with class Crom_body__UYOcU

I use nodejs and puppeteer and here is my code

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://www.nba.com/stats/teams/isolation?PerMode=Totals&TypeGrouping=offensive');

  const data = await page.evaluate(() => {
    const rows = Array.from(document.querySelectorAll('tbody.Crom_body__UYOcU tr'));

    return rows.map((row) => {
      const tds = Array.from(row.querySelectorAll('td'));
      return tds.map(td => td.textContent.trim());
    });
  });

  console.log(data); // an array of arrays containing the text content of each cell in each row
  await browser.close();
})();

Thanks for the help

Jerry
  • 85
  • 13

2 Answers2

3

Scraping HTML using a hashed class name is likely much more brittle than using the site's own API.

You don't need puppeteer to obtain the data in that table — you don't even need to scrape it: it's populated from data that's fetched without authentication from the following address (which I found in the dev tools network panel):

https://stats.nba.com/stats/synergyplaytypes?LeagueID=00&PerMode=Totals&PlayType=Isolation&PlayerOrTeam=T&SeasonType=Regular%20Season&SeasonYear=2022-23&TypeGrouping=offensive

It appears that the server only requires the appropriate Referer header in order to get a response with a JSON body, so you can simply request the data this way:

example.mjs:

const response = await fetch(
  "https://stats.nba.com/stats/synergyplaytypes?LeagueID=00&PerMode=Totals&PlayType=Isolation&PlayerOrTeam=T&SeasonType=Regular%20Season&SeasonYear=2022-23&TypeGrouping=offensive",
  { headers: new Headers([["Referer", "https://www.nba.com/"]]) },
);

const data = await response.json();

// The actual data:
const resultSet = data.resultSets[0];

const teamNameIndex = resultSet.headers.indexOf("TEAM_NAME");
const teamNames = resultSet.rowSet.map((array) => array[teamNameIndex]);

console.log("first five teams:", teamNames.slice(0, 5));

console.log("entire payload:", JSON.stringify(data, null, 2));

In the terminal:

% node --version  
v18.16.0

% node example.mjs
first five teams: [
  'Dallas Mavericks',
  'Philadelphia 76ers',
  'Brooklyn Nets',
  'New York Knicks',
  'Oklahoma City Thunder'
]
entire payload: {
  "resource": "synergyplaytype",
  "parameters": {
    "LeagueID": "00",
    "SeasonYear": "2022-23",
    "SeasonType": "Regular Season",
    "PerMode": "Totals",
    "PlayerOrTeam": "T",
    "PlayType": "Isolation",
    "TypeGrouping": "offensive"
  },
  "resultSets": [
    {
      "name": "SynergyPlayType",
      "headers": [
        "SEASON_ID",
        "TEAM_ID",
        "TEAM_ABBREVIATION",
        "TEAM_NAME",
        "PLAY_TYPE",
        "TYPE_GROUPING",
        "PERCENTILE",
        "GP",
        "POSS_PCT",
        "PPP",
        "FG_PCT",
        "FT_POSS_PCT",
        "TOV_POSS_PCT",
        "SF_POSS_PCT",
        "PLUSONE_POSS_PCT",
        "SCORE_POSS_PCT",
        "EFG_PCT",
        "POSS",
        "PTS",
        "FGM",
        "FGA",
        "FGMX"
      ],
      "rowSet": [
        [
          "22022",
          1610612742,
          "DAL",
          "Dallas Mavericks",
          "Isolation",
          "Offensive",
          0.828,
          82,
          0.122,
          1.018,
          0.423,
          0.169,
          0.07,
          0.15,
          0.031,
          0.466,
          0.485,
          1064,
          1083,
          356,
          842,
          486
        ],
        [
          "22022",
          1610612755,
          "PHI",
          "Philadelphia 76ers",
          "Isolation",
          "Offensive",
          ---snip---
        ],
        ---snip---
      ]
    }
  ]
}
jsejcksn
  • 27,667
  • 4
  • 38
  • 62
1

I see two problems:

  1. The site will block you as a bot if you use the default user agent in headless mode. See Why does headless need to be false for Puppeteer to work?
  2. The table data is added dynamically after page load, so you'll need to wait for it to arrive using a call like page.waitForSelector.

Here's the code:

const puppeteer = require("puppeteer"); // ^19.7.2

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const url =
    "https://www.nba.com/stats/teams/isolation?PerMode=Totals&TypeGrouping=offensive";
  const ua =
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36";
  await page.setUserAgent(ua);
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const bodySel = ".nba-stats-content-block tr";
  await page.waitForSelector(bodySel);
  const data = await page.$$eval(bodySel, rows =>
    rows.map(row =>
      [...row.querySelectorAll("td")]
        .map(cell => cell.textContent.trim())
    )
  );
  console.log(data);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Note that I'm using a selector that seems more stable and doesn't have random numbers at the end. If there's only going to be one table, plain "tr" and "td" may be even better.

If you want headers on your rows, try adding this after const data = ...:

// ...
  const headers = await page.$$eval(
    ".nba-stats-content-block th",
    els => els.map(e => e.textContent.trim())
  );
  const result = data.map(row =>
    Object.fromEntries(headers.map((x, i) => [x, row[i]]))
  );
  console.log(result);
// ...

To block unnecessary requests to boost the speed a bit, you can use:

// ...
  await page.setRequestInterception(true);
  const allowed = [
    url,
    "https://www.nba.com/_next",
    "https://stats.nba.com/",
  ];
  page.on("request", request => {
    if (allowed.some(e => request.url().startsWith(e))) {
      request.continue();
    }
    else {
      request.abort();
    }
  });
  await page.goto(url, {waitUntil: "domcontentloaded"});
// ...

If you examine the responses, you can see an API call that contains the table data:

https://stats.nba.com/stats/synergyplaytypes?LeagueID=00&PerMode=Totals&PlayType=Isolation&PlayerOrTeam=T&SeasonType=Regular%20Season&SeasonYear=2022-23&TypeGrouping=offensive

Instead of scraping the data from the DOM, you can intercept that response:

// ...
  const responseP = page.waitForResponse(response =>
    response
      .request()
      .url()
      .startsWith("https://stats.nba.com/")
  );
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const response = await responseP;
  const data = await response.json();
  console.log(data.resultSets[0].rowSet);
// ...

You can optionally use the same "zip" logic above to map the headers to the rows.

Now, it turns out all of this is just for instructive purposes. In this case, that API URL is unprotected and we can simply hit it with a simple HTTP request, without Puppeteer, as this answer nicely illustrates.

See this tutorial for how you can find unprotected endpoints like this. Note that it's not always possible to access them, so you'll often need to fall back on strategies in this post.

ggorlen
  • 44,755
  • 7
  • 76
  • 106