0

I am attempting to scrape tables from the website spotrac.com and save the data to a pandas dataframe. For whatever reason, if the table I am scraping is over 100 rows, the BeautifulSoup object only appears to grab the first 100 rows of the table. If you run my code below, you'll see that the resulting dataframe has only 100 rows, and ends with "David Montgomery." If you visit the webpage (https://www.spotrac.com/nfl/rankings/2019/base/running-back/) and ctrl+F "David Montgomery", you'll see that there are additional rows. If you change the webpage in the get row of the code to "https://www.spotrac.com/nfl/rankings/2019/base/wide-receiver/" you'll see that the same thing happens. Only the first 100 rows are included in the BeautifulSoup object and in the dataframe.

import pandas as pd
import requests, lxml.html
from bs4 import BeautifulSoup

# Begin requests session
with requests.session() as s:
        # Get page
        r = s.get('https://www.spotrac.com/nfl/rankings/2019/base/running-back/')
            
        # Get page content, find first table, and save to df
        soup = BeautifulSoup(r.content,'lxml')
        table = soup.find_all('table')[0]
        df_list = pd.read_html(str(table))
        df = df_list[0]

I have read that changing the parser can help. I have tried using different parsers by replacing the BeautifulSoup object code with the following:

soup = BeautifulSoup(r.content,'html5lib')
soup = BeautifulSoup(r.content,'html.parser')

Neither of these changes worked. I have run "pip install html5lib" and "pip install lxml" and confirmed that both were already installed.

dwismer
  • 15
  • 4
  • I think it's probably because some of the entries are dynamically loaded using JavaScript. For that you would need a different package, such as `dryscrape` mentioned here: https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python – zmike Jul 01 '20 at 02:48
  • when I turn off JavaScript then this page display only 100 rows. `requests` and `Beautifulsoup` can't run JavaScritp and you can't get more rows. You will have to use [Selenium](https://selenium-python.readthedocs.io/) to control real web browser which can run JavaScript. OR you can try to use `DevTools` in Firefox/Chrome to get url used by JavaScript to get more rows and then you can use this url with `requests`. – furas Jul 01 '20 at 03:49

2 Answers2

1

This page uses JavaScript to load extra data.

In DevTools in Firefox/Chrome you can see it sends POST request with extra information {'ajax': True, 'mobile': False}

import pandas as pd
import requests, lxml.html
from bs4 import BeautifulSoup

with requests.session() as s:
    
    r = s.post('https://www.spotrac.com/nfl/rankings/2019/base/running-back/', data={'ajax': True, 'mobile': False})
        
    # Get page content, find first table, and save to df
    soup = BeautifulSoup(r.content, 'lxml')
    table = soup.find_all('table')[0]
    df_list = pd.read_html(str(table))
    df = df_list[0]
    print(df)
    
furas
  • 134,197
  • 12
  • 106
  • 148
0

I suggest you use request-html

import pandas as pd
from bs4 import BeautifulSoup
from requests_html import HTMLSession


if __name__ == "__main__":
    # Begin requests session
    s = HTMLSession()
    # Get page
    r = s.get('https://www.spotrac.com/nfl/rankings/2019/base/running-back/')
    r.html.render()
    # Get page content, find first table, and save to df
    soup = BeautifulSoup(r.html.html, 'lxml')
    table = soup.find_all('table')[0]
    df_list = pd.read_html(str(table))
    df = df_list[0]

Then you will get 140 lines.

Xu Qiushi
  • 1,111
  • 1
  • 5
  • 10