-1

I'm trying to write a short python snippet of code that loops through different webpages structured in the same way (i.e. same number of columns / rows) and loads all the information into a pandas dataframe and finally exports this one into excel.

I managed to write all the code that gathers what should be the column headers (in the dt HTML tag) and the rows (in the dd HTML tag), but having issues into placing all this info into a pandas dataframe.

    for row in rows:
        QA_link = row.find('td', class_='views-field views-field-nothing-1').find('a', href=True)['href']
        req_QA = requests.get(QA_link)
        soup_QA = BeautifulSoup(req_QA.text, 'html.parser')
        QA_table = soup_QA.find('dl', class_='dl-horizontal SingleRulebookRecord')
        if boolInitialiseTable:
            QA_hdr = [str.replace(link.string, ':', '') for link in QA_table.findAll('dt')]
            QA_details = [str(link.string) for link in QA_table.findAll('dd')]
            df = pd.DataFrame()
            df = pd.concat([df, pd.DataFrame(QA_details).transpose()], ignore_index=True, axis=0)
            boolInitialiseTable = False
            df.columns = QA_hdr
        else:
            QA_details = [str(link.string) for link in QA_table.findAll('dd')]
            df = pd.concat([df, pd.DataFrame(QA_details).transpose()])

Where rows contains all the different web pages that needs to be accessed to gather the info i need to put in the pandas dataframe.

So from the HTML table like content of:

<dl class="dl-horizontal SingleRulebookRecord">
       <dt>Question ID:</dt>
       <dd>2020_5469 </dd>
       <dt>Topic:</dt>
       <dd>Weather</dd>
       <dt>Date</dt>
       <dd>06/06/2020</dd>
</dl>

I would like to get a pandas dataframe with:

Question ID Topic Date
2020_5469 Weather 06/06/2020

Finally df.to_excel('results.xlsx') should do the job of exporting everything into Excel.

I feel that all this transpose in the code is not the correct way of doing it, in addition to that the type of the fields of the table is object and not string as i would expect - but maybe this is not a problem

Mario
  • 137
  • 1
  • 10

1 Answers1

2

I would do it like this:

import requests
from bs4 import BeautifulSoup
import pandas as pd

links = ['https://www.eba.europa.eu/single-rule-book-qa/qna/view/publicId/2020_5469',
         'https://www.eba.europa.eu/single-rule-book-qa/qna/view/publicId/2020_5128']

dfs = []

for QA_link in links:
    req_QA = requests.get(QA_link)
    soup_QA = BeautifulSoup(req_QA.text, 'html.parser')
    
    QA_hdr = [link.get_text() for link in soup_QA.findAll('dt')]
    QA_details = [[link.get_text() for link in soup_QA.findAll('dd')]]

    dfs.append(pd.DataFrame(QA_details, columns=QA_hdr))

df_all = pd.concat(dfs, axis=0).reset_index(drop=True)

# check for NaN values (columns not shared between urls)
print(df_all[df_all.columns[df_all.isna().any()]].T)

                                                                 0    1
Name of institution / submitter:       BearingPoint Switzerland AG  NaN
Country of incorporation / residence:                  Switzerland  NaN
Answer prepared by:                    Answer prepared by the EBA.  NaN
Subparagraph:                                                  NaN  (f)

df_all.iloc[:,:5].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Question ID:  2 non-null      object
 1   Legal Act:    2 non-null      object
 2   Topic:        2 non-null      object
 3   Article:      2 non-null      object
 4   Paragraph:    2 non-null      object
dtypes: object(5)
memory usage: 208.0+ bytes

Notice that QA_details is a nested list. E.g. each nested list would fill a new row; it's just that you only have one. E.g. here's how it works if you have two nested lists:

lst = [[1,2],[3,4]]
df = pd.DataFrame(lst, columns=['A','B'])

print(df)

   A  B
0  1  2
1  3  4

As for the reason why the Dtype is given as object, see e.g. this SO post. But all your cells will in fact contain strings, which we can easily check. E.g.:

cols = df_all.columns[df_all.notna().all()]
print(all([isinstance(i, str) for i in df_all.loc[0, cols]]))
# True

Finally, yes df.to_excel('results.xlsx') will work to export the df to Excel. Perhaps add df.to_excel('results.xlsx', index=False) to avoid exporting the index.

ouroboros1
  • 9,113
  • 3
  • 7
  • 26
  • That works perfectly fine for the first url - when i loop to the second URL, how can i add the new QA_details into the df dataframe? – Mario Sep 14 '22 at 16:50
  • Depends a bit. Are you certain that *all* URLs will have the same "columns", or could there also be pages that have different ones? – ouroboros1 Sep 14 '22 at 16:54
  • Just discovered that the URLs might have slightly different columns, e.g. here for "Subparagraph:" and the missing "Name of institution / submitter:" https://www.eba.europa.eu/single-rule-book-qa/qna/view/publicId/2020_5128 – Mario Sep 14 '22 at 16:59
  • 1
    Right, updated the answer. With different columns it is probably safest just to create individual dfs and then to use `pd.concat` at the end. – ouroboros1 Sep 14 '22 at 17:01
  • 1
    Forgot this, but you will want to reset the index here: `df_all = pd.concat(dfs, axis=0).reset_index(drop=True)` – ouroboros1 Sep 14 '22 at 17:08