Pandas.read_html only getting header of html table

Question

So I'm using pandas.read_html to try to get a table from a website. For some reason it's not giving me the entire table and it's just getting the header row. How can I fix this?

Code:

import pandas as pd

term_codes = {"fall":"10", "spring":"20", "summer":"30"}

# year must be last number in school year: 2021-2022 so we pick 2022
year = "2022"
department = "CSCI"
term_code = year + term_codes["fall"]
url = "https://courselist.wm.edu/courselist/courseinfo/searchresults?term_code=" + term_code + "&term_subj=" + department + "&attr=0&attr2=0&levl=0&status=0&ptrm=0&search=Search"

def findCourseTable():
    dfs = pd.read_html(url)
    print(dfs[0])
    #df = dfs[1]
    #df.to_csv(r'courses.csv', index=False)

if __name__ == "__main__":
    findCourseTable()

Output:

Empty DataFrame
Columns: [CRN, COURSE ID, CRSE ATTR, TITLE, INSTRUCTOR, CRDT HRS, MEET DAY:TIME, PROJ ENR, CURR ENR, SEATS AVAIL, STATUS]
Index: []

score 3 · Accepted Answer · answered Aug 26 '21 at 18:27

The page contains malformed HTML code, so use flavor="html5lib" in pd.read_html to read it correctly:

import pandas as pd

term_codes = {"fall": "10", "spring": "20", "summer": "30"}

# year must be last number in school year: 2021-2022 so we pick 2022
year = "2022"
department = "CSCI"
term_code = year + term_codes["fall"]
url = (
    "https://courselist.wm.edu/courselist/courseinfo/searchresults?term_code="
    + term_code
    + "&term_subj="
    + department
    + "&attr=0&attr2=0&levl=0&status=0&ptrm=0&search=Search"
)

df = pd.read_html(url, flavor="html5lib")[0]
print(df)

Prints:

      CRN     COURSE ID  CRSE ATTR                           TITLE                        INSTRUCTOR CRDT HRS  MEET DAY:TIME  PROJ ENR  CURR ENR SEATS AVAIL  STATUS
0   16064   CSCI 100 01  C100, NEW                  Reading@Russia  Willner, Dana; Prokhorova, Elena        4  MWF:1300-1350        10        10          0*  CLOSED
1   14614   CSCI 120 01        NaN  A Career in CS? And Which One?                     Kemper, Peter        1    M:1700-1750        36        20          16    OPEN
2   16325   CSCI 120 02        NEW    Concepts in Computer Science                   Deverick, James        3   TR:0800-0920        36        25          11    OPEN
3   12372   CSCI 140 01   NEW, NQR    Programming for Data Science                 Khargonkar, Arohi        4  MWF:0900-0950        36        24          12    OPEN
4   14620   CSCI 140 02   NEW, NQR    Programming for Data Science                 Khargonkar, Arohi        4  MWF:1100-1150        36        27           9    OPEN
5   13553   CSCI 140 03   NEW, NQR    Programming for Data Science                 Khargonkar, Arohi        4  MWF:1300-1350        36        25          11    OPEN

...and so on.

Pandas.read_html only getting header of html table

1 Answers1

Linked