1

I've got a page I'm scraping and most of the tables are in the format Heading --info. I can iterate through most of the tables and create separate dataframes for all the various information using pandas.read_html.

However, there are some where they've combined information into one table with subheadings that I want to be separate dataframes with the text of that row as the heading (appending a list).

Is there an easy way to split this dataframe - It will always be heading followed by associated rows, new heading followed by new associated rows.

eg.

   Col1   Col2
0  thing   
1  1       2
2  2       3
3  thing2  
4  1       2
5  2       3
6  3       4

Should be

thing
1   1   1
2   2   2

thing2
4   1   2
5   2   3
6   3   4

It'd be nice if people would just create web pages that made sense with the data but that's not the case here.

I've tried iterrows but cannot seem to come up with a good way to create what I want.

Help would be very much appreciated!

<div class="ranking">
    <h6><a href="javascript:;">Sprint</a></h6>
    <table>
    <tbody>
    
    
    </tbody>
    <tbody>
    <tr>
     <td class="title" colspan="8">Canneto - km 137</td>
    </tr>
    
    <tr>
    <td class="rank"><span title="Rank">1</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">21</span></td>
    <td class="name">
    <a class="10010085859" href="javascript:;">
    <abbr title="Young rider">*</abbr>
    BAGIOLI Nicola
    </a>
    
    </td>
    <td class="team"><img alt="ANDRONI GIOCATTOLI - SIDERMEC" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ANS.png" title="ANDRONI GIOCATTOLI - SIDERMEC"/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">5</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">2</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">54</span></td>
    <td class="name">
    <a class="10008688453" href="javascript:;">
    ORSINI Umberto
    </a>
    
    </td>
    <td class="team"><img alt="BARDIANI CSF FAIZANE'" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BCF.png" title="BARDIANI CSF FAIZANE'"/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">3</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">3</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">247</span></td>
    <td class="name">
    <a class="10005658114" href="javascript:;">
    ZARDINI Edoardo
    </a>
    
    </td>
    <td class="team"><img alt="VINI ZABU' KTM" src="/Content/images/event/2020/tirreno-adriatico/jerseys/THR.png" title="VINI ZABU' KTM"/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">2</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">4</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">63</span></td>
    <td class="name">
    <a class="10003349312" href="javascript:;">
    BODNAR Maciej
    </a>
    
    </td>
    <td class="team"><img alt="BORA - HANSGROHE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BOH.png" title="BORA - HANSGROHE"/></td>
    
    <td class="noc"><img alt="POL" src="/Content/images/flags/POL.png" title="POL"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">1</td>
    
    </tr>
    
    </tbody>
    <tbody>
    <tr>
     <td class="title" colspan="8">Follonica - km 190</td>
    </tr>
    
    <tr>
    <td class="rank"><span title="Rank">1</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">62</span></td>
    <td class="name">
    <a class="10007738055" href="javascript:;">
    ACKERMANN Pascal
    </a>
    
    </td>
    <td class="team"><img alt="BORA - HANSGROHE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BOH.png" title="BORA - HANSGROHE"/></td>
    
    <td class="noc"><img alt="GER" src="/Content/images/flags/GER.png" title="GER"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">12</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">2</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">231</span></td>
    <td class="name">
    <a class="10008656828" href="javascript:;">
    GAVIRIA RENDON Fernando
    </a>
    
    </td>
    <td class="team"><img alt="UAE TEAM EMIRATES" src="/Content/images/event/2020/tirreno-adriatico/jerseys/UAD.png" title="UAE TEAM EMIRATES"/></td>
    
    <td class="noc"><img alt="COL" src="/Content/images/flags/COL.png" title="COL"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">10</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">3</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">137</span></td>
    <td class="name">
    <a class="10007506366" href="javascript:;">
    ZABEL Rick
    </a>
    
    </td>
    <td class="team"><img alt="ISRAEL START - UP NATION" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ISN.png" title="ISRAEL START - UP NATION"/></td>
    
    <td class="noc"><img alt="GER" src="/Content/images/flags/GER.png" title="GER"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">8</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">4</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">91</span></td>
    <td class="name">
    <a class="10008661777" href="javascript:;">
    BALLERINI Davide
    </a>
    
    </td>
    <td class="team"><img alt="DECEUNINCK  -  QUICK - STEP " src="/Content/images/event/2020/tirreno-adriatico/jerseys/DQT.png" title="DECEUNINCK  -  QUICK - STEP "/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">7</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">5</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">12</span></td>
    <td class="name">
    <a class="10007096239" href="javascript:;">
    MERLIER Tim
    </a>
    
    </td>
    <td class="team"><img alt="ALPECIN - FENIX" src="/Content/images/event/2020/tirreno-adriatico/jerseys/AFC.png" title="ALPECIN - FENIX"/></td>
    
    <td class="noc"><img alt="BEL" src="/Content/images/flags/BEL.png" title="BEL"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">6</td>
    
    </tr>
    <tr>
    <td class="more" colspan="8"><a href="javascript:;">More...</a></td>
    </tr>
    <tr style="display: none;">
    <td class="rank"><span title="Rank">6</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">133</span></td>
    <td class="name">
    <a class="10028417041" href="javascript:;">
    CIMOLAI Davide
    </a>
    
    </td>
    <td class="team"><img alt="ISRAEL START - UP NATION" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ISN.png" title="ISRAEL START - UP NATION"/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">5</td>
    
    </tr>
    <tr style="display: none;">
    <td class="rank"><span title="Rank">7</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">213</span></td>
    <td class="name">
    <a class="10007216275" href="javascript:;">
    MANZIN Lorrenzo
    </a>
    
    </td>
    <td class="team"><img alt="TOTAL DIRECT ENERGIE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/TDE.png" title="TOTAL DIRECT ENERGIE"/></td>
    
    <td class="noc"><img alt="FRA" src="/Content/images/flags/FRA.png" title="FRA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">4</td>
    
    </tr>
    <tr style="display: none;">
    <td class="rank"><span title="Rank">8</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">23</span></td>
    <td class="name">
    <a class="10007744624" href="javascript:;">
    PACIONI Luca
    </a>
    
    </td>
    <td class="team"><img alt="ANDRONI GIOCATTOLI - SIDERMEC" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ANS.png" title="ANDRONI GIOCATTOLI - SIDERMEC"/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">3</td>
    
    </tr>
    <tr style="display: none;">
    <td class="rank"><span title="Rank">9</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">147</span></td>
    <td class="name">
    <a class="10010946028" href="javascript:;">
    <abbr title="Young rider">*</abbr>
    VERMEERSCH Florian
    </a>
    
    </td>
    <td class="team"><img alt="LOTTO SOUDAL" src="/Content/images/event/2020/tirreno-adriatico/jerseys/LTS.png" title="LOTTO SOUDAL"/></td>
    
    <td class="noc"><img alt="BEL" src="/Content/images/flags/BEL.png" title="BEL"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">2</td>
    
    </tr>
    <tr style="display: none;">
    <td class="rank"><span title="Rank">10</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">195</span></td>
    <td class="name">
    <a class="10006631548" href="javascript:;">
    TEUNISSEN Mike
    </a>
    
    </td>
    <td class="team"><img alt="JUMBO - VISMA" src="/Content/images/event/2020/tirreno-adriatico/jerseys/TJV.png" title="JUMBO - VISMA"/></td>
    
    <td class="noc"><img alt="NED" src="/Content/images/flags/NED.png" title="NED"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">1</td>
    
    </tr>
    
    </tbody>
    </table>
</div>
lweislo
  • 39
  • 7
  • please provide the code you used to generate the first dataframe – Reza Sep 08 '20 at 18:10
  • 1
    how to identify the start of new header? for this sample data it string what if the table data also string how to identify the header? – deadshot Sep 08 '20 at 18:12
  • The code to generate the first dataframe just a beautiful soup scrape of a webpage, that's why it's such a mess. @deadshot - the header will be text but everything else in the table is an integer. But of course because there is text in there pandas things the column is all strings.. Also I made a mistake, the other columns are blank where the header is in column 1. – lweislo Sep 08 '20 at 18:26
  • Pretty broad question. Which part are you having trouble with - identifing the rows with the headers or *splitting* after the rows have been identified? – wwii Sep 08 '20 at 18:30
  • You should probably include the html - a minimal example - as part of your [mre]. It might be worthwhile creating the separate DataFrames from the markup rather than trying to split a *composite* DataFrame. That said, does either [Pandas Split DataFrame using row index](https://stackoverflow.com/questions/53391378/pandas-split-dataframe-using-row-index) or [Python Pandas: Get index of rows which column matches certain value](https://stackoverflow.com/questions/21800169/python-pandas-get-index-of-rows-which-column-matches-certain-value) help or solve your problem? – wwii Sep 08 '20 at 18:43
  • @lweislo did all the columns have data? is there any chance other rows also empty? – deadshot Sep 08 '20 at 18:48
  • As @deadshot asked - if we knew how to identify a header it would help. – wwii Sep 08 '20 at 18:48
  • Sorry @wwii - I have pasted in the html in a code block if it helps any. I looked at those other threads before posting but they didn't quite answer my question. – lweislo Sep 08 '20 at 18:51
  • @deadshot - I started with iterrows, using regex to look for non-numerics to identify the row with the header. I can blank out the information in the columns that way but I need to retain the header information and split the tables up. – lweislo Sep 08 '20 at 19:07
  • @lweislo what is the `Col2` value here `0 thing ` and `3 thing2 ` is it `NaN` or empty string? and after split what is the column name for `Col2`? – deadshot Sep 08 '20 at 19:11
  • @deadshot empty string. The columns are unnamed 1,2,3 – lweislo Sep 08 '20 at 19:21

2 Answers2

1

You can use np.split()

import numpy as np


res = [x.reset_index(drop=True) for x in np.split(df, np.where(df.applymap(lambda x: x == ''))[0]) if not x.empty]
for x in res:
    x = x.rename(columns=x.iloc[0]).drop(x.index[0])
    print(x)

Output:

  thing   
1     1  2
2     2  3
  thing2   
1      1  2
2      2  3
3      3  4
deadshot
  • 8,881
  • 4
  • 20
  • 39
0

Identify the headers and use cumsum() to groupby then append each group to a list.

import pandas as pd
df = pd.DataFrame({'Col1': {0: 'thing', 1: '1', 2: '2', 3: 'thing2', 4: '1',5: '2', 6: '3'},
                   'Col2': {0:'' , 1: 2, 2: 3, 3:'' , 4: 2, 5: 3, 6: 4}})
gb = df.groupby((~df.Col2.astype(bool)).cumsum())


dfs = []
for k,g in gb:
    dfs.append(g.copy())

In [42]: dfs[0]
Out[42]: 
    Col1  Col2
0  thing     
1      1     2
2      2     3

In [43]: dfs[1]
Out[43]: 
     Col1  Col2
3  thing2     
4       1     2
5       2     3
6       3     4
wwii
  • 23,232
  • 7
  • 37
  • 77