1

The below is the HTML code which I'm trying to scrape

<div class="data-point-container section-break">
    # some other HTML div classes here which I don't need
    <table class data-bind="showHidden: isData">
          <!-- ko foreach : sections -->
        <thead>...</thead>
        <tbody>...</tbody>
        <thead>...</thead>
        <tbody>...</tbody>
        <thead>...</thead>
        <tbody>...</tbody>
        <thead>...</thead>
        <tbody>...</tbody>
        <thead>...</thead>
        <tbody>...</tbody>
          <!-- /ko -->
    </table>
</div>

How do I use Pandas.read_html to scrape all these information, having thead as headers, and tbody as values?

EDIT:

This is the site that I'm trying to scrape, and have the data extracted into Pandas Dataframe. Link here

jake wong
  • 4,909
  • 12
  • 42
  • 85
  • This actually violates the spec, you cannot have multiple `thead` or `tfoot` elements in a `table`: http://stackoverflow.com/a/16155425/771848. – alecxe Jul 17 '16 at 02:58
  • Could you post the complete table? - at least with some `thead` and `tbody` expanded.. – alecxe Jul 17 '16 at 02:58
  • Hi alecxe, i've added the link of what i'm trying to scrape. There's too much HTML code for me to put in stackoverflow so I thought it might be better to just show you what data am I trying to get.. – jake wong Jul 17 '16 at 03:12

1 Answers1

2

Strictly speaking, one should not have more than one thead element per table according to the table element specification.

If you still have this thead followed by corresponding tbody structure, I would parse that iteratively - every structure like this into it's own dataframe.

Working example:

import pandas as pd
from bs4 import BeautifulSoup

data = """
<div class="data-point-container section-break">
    <table class data-bind="showHidden: isData">

        <thead>
            <tr><th>Customer</th><th>Order</th><th>Month</th></tr>
        </thead>
        <tbody>
            <tr><td>Customer 1</td><td>#1</td><td>January</td></tr>
            <tr><td>Customer 2</td><td>#2</td><td>April</td></tr>
            <tr><td>Customer 3</td><td>#3</td><td>March</td></tr>
        </tbody>

        <thead>
            <tr><th>Customer</th></tr>
        </thead>
        <tbody>
            <tr><td>Customer 4</td></tr>
            <tr><td>Customer 5</td></tr>
            <tr><td>Customer 6</td></tr>
        </tbody>

    </table>
</div>
"""

soup = BeautifulSoup(data, "html.parser")
for thead in soup.select(".data-point-container table thead"):
    tbody = thead.find_next_sibling("tbody")

    table = "<table>%s</table>" % (str(thead) + str(tbody))

    df = pd.read_html(str(table))[0]
    print(df)
    print("-----")

Prints 2 dataframes - one for every thead&tbody in the sample input HTML:

     Customer Order    Month
0  Customer 1    #1  January
1  Customer 2    #2    April
2  Customer 3    #3    March
-----
     Customer
0  Customer 4
1  Customer 5
2  Customer 6
-----

Note that I've intentionally made the number of header and data cells different in every block for demonstration purposes.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195