I've got a page I'm scraping and most of the tables are in the format Heading --info. I can iterate through most of the tables and create separate dataframes for all the various information using pandas.read_html.
However, there are some where they've combined information into one table with subheadings that I want to be separate dataframes with the text of that row as the heading (appending a list).
Is there an easy way to split this dataframe - It will always be heading followed by associated rows, new heading followed by new associated rows.
eg.
Col1 Col2
0 thing
1 1 2
2 2 3
3 thing2
4 1 2
5 2 3
6 3 4
Should be
thing
1 1 1
2 2 2
thing2
4 1 2
5 2 3
6 3 4
It'd be nice if people would just create web pages that made sense with the data but that's not the case here.
I've tried iterrows but cannot seem to come up with a good way to create what I want.
Help would be very much appreciated!
<div class="ranking">
<h6><a href="javascript:;">Sprint</a></h6>
<table>
<tbody>
</tbody>
<tbody>
<tr>
<td class="title" colspan="8">Canneto - km 137</td>
</tr>
<tr>
<td class="rank"><span title="Rank">1</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">21</span></td>
<td class="name">
<a class="10010085859" href="javascript:;">
<abbr title="Young rider">*</abbr>
BAGIOLI Nicola
</a>
</td>
<td class="team"><img alt="ANDRONI GIOCATTOLI - SIDERMEC" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ANS.png" title="ANDRONI GIOCATTOLI - SIDERMEC"/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">5</td>
</tr>
<tr>
<td class="rank"><span title="Rank">2</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">54</span></td>
<td class="name">
<a class="10008688453" href="javascript:;">
ORSINI Umberto
</a>
</td>
<td class="team"><img alt="BARDIANI CSF FAIZANE'" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BCF.png" title="BARDIANI CSF FAIZANE'"/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">3</td>
</tr>
<tr>
<td class="rank"><span title="Rank">3</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">247</span></td>
<td class="name">
<a class="10005658114" href="javascript:;">
ZARDINI Edoardo
</a>
</td>
<td class="team"><img alt="VINI ZABU' KTM" src="/Content/images/event/2020/tirreno-adriatico/jerseys/THR.png" title="VINI ZABU' KTM"/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">2</td>
</tr>
<tr>
<td class="rank"><span title="Rank">4</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">63</span></td>
<td class="name">
<a class="10003349312" href="javascript:;">
BODNAR Maciej
</a>
</td>
<td class="team"><img alt="BORA - HANSGROHE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BOH.png" title="BORA - HANSGROHE"/></td>
<td class="noc"><img alt="POL" src="/Content/images/flags/POL.png" title="POL"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">1</td>
</tr>
</tbody>
<tbody>
<tr>
<td class="title" colspan="8">Follonica - km 190</td>
</tr>
<tr>
<td class="rank"><span title="Rank">1</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">62</span></td>
<td class="name">
<a class="10007738055" href="javascript:;">
ACKERMANN Pascal
</a>
</td>
<td class="team"><img alt="BORA - HANSGROHE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BOH.png" title="BORA - HANSGROHE"/></td>
<td class="noc"><img alt="GER" src="/Content/images/flags/GER.png" title="GER"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">12</td>
</tr>
<tr>
<td class="rank"><span title="Rank">2</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">231</span></td>
<td class="name">
<a class="10008656828" href="javascript:;">
GAVIRIA RENDON Fernando
</a>
</td>
<td class="team"><img alt="UAE TEAM EMIRATES" src="/Content/images/event/2020/tirreno-adriatico/jerseys/UAD.png" title="UAE TEAM EMIRATES"/></td>
<td class="noc"><img alt="COL" src="/Content/images/flags/COL.png" title="COL"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">10</td>
</tr>
<tr>
<td class="rank"><span title="Rank">3</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">137</span></td>
<td class="name">
<a class="10007506366" href="javascript:;">
ZABEL Rick
</a>
</td>
<td class="team"><img alt="ISRAEL START - UP NATION" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ISN.png" title="ISRAEL START - UP NATION"/></td>
<td class="noc"><img alt="GER" src="/Content/images/flags/GER.png" title="GER"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">8</td>
</tr>
<tr>
<td class="rank"><span title="Rank">4</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">91</span></td>
<td class="name">
<a class="10008661777" href="javascript:;">
BALLERINI Davide
</a>
</td>
<td class="team"><img alt="DECEUNINCK - QUICK - STEP " src="/Content/images/event/2020/tirreno-adriatico/jerseys/DQT.png" title="DECEUNINCK - QUICK - STEP "/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">7</td>
</tr>
<tr>
<td class="rank"><span title="Rank">5</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">12</span></td>
<td class="name">
<a class="10007096239" href="javascript:;">
MERLIER Tim
</a>
</td>
<td class="team"><img alt="ALPECIN - FENIX" src="/Content/images/event/2020/tirreno-adriatico/jerseys/AFC.png" title="ALPECIN - FENIX"/></td>
<td class="noc"><img alt="BEL" src="/Content/images/flags/BEL.png" title="BEL"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">6</td>
</tr>
<tr>
<td class="more" colspan="8"><a href="javascript:;">More...</a></td>
</tr>
<tr style="display: none;">
<td class="rank"><span title="Rank">6</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">133</span></td>
<td class="name">
<a class="10028417041" href="javascript:;">
CIMOLAI Davide
</a>
</td>
<td class="team"><img alt="ISRAEL START - UP NATION" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ISN.png" title="ISRAEL START - UP NATION"/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">5</td>
</tr>
<tr style="display: none;">
<td class="rank"><span title="Rank">7</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">213</span></td>
<td class="name">
<a class="10007216275" href="javascript:;">
MANZIN Lorrenzo
</a>
</td>
<td class="team"><img alt="TOTAL DIRECT ENERGIE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/TDE.png" title="TOTAL DIRECT ENERGIE"/></td>
<td class="noc"><img alt="FRA" src="/Content/images/flags/FRA.png" title="FRA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">4</td>
</tr>
<tr style="display: none;">
<td class="rank"><span title="Rank">8</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">23</span></td>
<td class="name">
<a class="10007744624" href="javascript:;">
PACIONI Luca
</a>
</td>
<td class="team"><img alt="ANDRONI GIOCATTOLI - SIDERMEC" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ANS.png" title="ANDRONI GIOCATTOLI - SIDERMEC"/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">3</td>
</tr>
<tr style="display: none;">
<td class="rank"><span title="Rank">9</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">147</span></td>
<td class="name">
<a class="10010946028" href="javascript:;">
<abbr title="Young rider">*</abbr>
VERMEERSCH Florian
</a>
</td>
<td class="team"><img alt="LOTTO SOUDAL" src="/Content/images/event/2020/tirreno-adriatico/jerseys/LTS.png" title="LOTTO SOUDAL"/></td>
<td class="noc"><img alt="BEL" src="/Content/images/flags/BEL.png" title="BEL"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">2</td>
</tr>
<tr style="display: none;">
<td class="rank"><span title="Rank">10</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">195</span></td>
<td class="name">
<a class="10006631548" href="javascript:;">
TEUNISSEN Mike
</a>
</td>
<td class="team"><img alt="JUMBO - VISMA" src="/Content/images/event/2020/tirreno-adriatico/jerseys/TJV.png" title="JUMBO - VISMA"/></td>
<td class="noc"><img alt="NED" src="/Content/images/flags/NED.png" title="NED"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">1</td>
</tr>
</tbody>
</table>
</div>