Create a Dataframe from HTML

Question

I am trying to read a table from a web-page. Generally, my company has strict authentication policies restricting us in the way we can scrape the data. But the following code is how I am trying to use to do the same

from urllib.request import urlopen
from requests_kerberos import HTTPKerberosAuth, OPTIONAL
import os
import lxml.html as LH
import requests
import pandas as pd

cert = r"C:\\Users\\name\\Desktop\\cacert.pem"
os.environ["REQUESTS_CA_BUNDLE"] = cert
kerberos = HTTPKerberosAuth(mutual_authentication=OPTIONAL)
session = requests.Session()

link = 'weblink'
data=session.get(link,auth=kerberos,verify=False).content.decode("latin-1")

And that leaves me with the entire HTML of the webpage in "data". How do I convert this into a dataframe?

Note : I couldn't provide the weblink due to privacy concerns.. I was just wondering if there was a general way which I can use to tackle this situation.

I was just wondering if there was a procedure to convert the HTML into a dataframe. That's what the question is about — jack ryan, Oct 21 '19 at 04:06
[`pandas.read_html`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html) if there are tables, they can be read directly into pandas. — Trenton McKinney, Oct 21 '19 at 05:06

caxcaxcoatl · Accepted Answer · 2019-10-21T05:10:21.733

It looks like you're looking for something like this, using Beautifulsoup?

From there, you'll have to create the data frame itself, but you will have passed the 'procedure to convert the HTML into' a data structure step. (that is, read the HTML table into a list or dictionary, and then transform it into a dataframe)

Edit 1

Actually, you can use Pandas' read_html. You might need Beautifulsoup still to get exactly what you want, but depending on how the source HTML looks like, it might be enough alone.

Create a Dataframe from HTML

1 Answers1

Edit 1