I'm trying to write a short python snippet of code that loops through different webpages structured in the same way (i.e. same number of columns / rows) and loads all the information into a pandas dataframe and finally exports this one into excel.
I managed to write all the code that gathers what should be the column headers (in the dt HTML tag) and the rows (in the dd HTML tag), but having issues into placing all this info into a pandas dataframe.
for row in rows:
QA_link = row.find('td', class_='views-field views-field-nothing-1').find('a', href=True)['href']
req_QA = requests.get(QA_link)
soup_QA = BeautifulSoup(req_QA.text, 'html.parser')
QA_table = soup_QA.find('dl', class_='dl-horizontal SingleRulebookRecord')
if boolInitialiseTable:
QA_hdr = [str.replace(link.string, ':', '') for link in QA_table.findAll('dt')]
QA_details = [str(link.string) for link in QA_table.findAll('dd')]
df = pd.DataFrame()
df = pd.concat([df, pd.DataFrame(QA_details).transpose()], ignore_index=True, axis=0)
boolInitialiseTable = False
df.columns = QA_hdr
else:
QA_details = [str(link.string) for link in QA_table.findAll('dd')]
df = pd.concat([df, pd.DataFrame(QA_details).transpose()])
Where rows contains all the different web pages that needs to be accessed to gather the info i need to put in the pandas dataframe.
So from the HTML table like content of:
<dl class="dl-horizontal SingleRulebookRecord">
<dt>Question ID:</dt>
<dd>2020_5469 </dd>
<dt>Topic:</dt>
<dd>Weather</dd>
<dt>Date</dt>
<dd>06/06/2020</dd>
</dl>
I would like to get a pandas dataframe with:
Question ID | Topic | Date |
---|---|---|
2020_5469 | Weather | 06/06/2020 |
Finally df.to_excel('results.xlsx') should do the job of exporting everything into Excel.
I feel that all this transpose in the code is not the correct way of doing it, in addition to that the type of the fields of the table is object and not string as i would expect - but maybe this is not a problem