How to download html table content?

Question

I want to download financial data ("konsernregnskap" not "morregnskap") from the following website, but I am not sure how to get all content downloaded: https://www.proff.no/regnskap/yara-international-asa/oslo/hovedkontortjenester/IGB6AV410NZ/

Tried to locate the tables with xpath but I have been unsuccessful.

I want to download all content into one excel sheet.

You need to check `//div[@id="keyFigures_corporateAccounts"]` for the data you need. — gangabass, Aug 11 '19 at 12:35

score 2 · Accepted Answer · answered Aug 11 '19 at 21:01

The answer given by @rusu_ro1 is correct. However, I think that Pandas is the right tool for job here.

You can use pandas.read_html to get all the tables in the page. Then use pandas.DataFrame.to_excel to write only the last 4 tables to the excel workbook.

The following script scrapes the data and writes each table to a different sheet.

import pandas as pd
all_tables = pd.read_html(
    "https://www.proff.no/regnskap/yara-international-asa/oslo/hovedkontortjenester/IGB6AV410NZ/"
)
with pd.ExcelWriter('output.xlsx') as writer:
    # Last 4 tables has the 'konsernregnskap' data
    for idx, df in enumerate(all_tables[4:8]):
        # Remove last column (empty)
        df = df.drop(df.columns[-1], axis=1)
        df.to_excel(writer, "Table {}".format(idx))

Notes:

You can also write all the DataFrames to a single sheet.
Ensure that lxml library is installed. pip install lxml

flavor : str or None, container of strings

The parsing engine to use. ‘bs4’ and ‘html5lib’ are synonymous with each other, they are both there for backwards compatibility. The default of None tries to use lxml to parse and if that fails it falls back on bs4 + html5lib.

From HTML Table Parsing Gotchas

html5lib generates valid HTML5 markup from invalid markup automatically. This is extremely important for parsing HTML tables, since it guarantees a valid document. However, that does NOT mean that it is “correct”, since the process of fixing markup does not have a single definition.

In your specific case it drops the 5th table (it returns only 7). Perhaps b'coz both 1st and 5th table has the same data.

score 1 · Answer 2 · answered Aug 11 '19 at 13:01

you have 8 tables within class table-wrap, first 4 tables belong to "morregnskap" tab and the next 4 tables belong to "konsernregnskap" tab, so by choosing the last 4 you are choosing your desired tables from where you can start to scrape your data

import requests
import json
import bs4

url = 'https://www.proff.no/regnskap/yara-international-asa/oslo/hovedkontortjenester/IGB6AV410NZ/'


response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'lxml')
tables = soup.find_all('div', {'table-wrap'})


konsernregnskap_data = tables[5:]

How to download html table content?

2 Answers2