python selenium scraping tbody

Question

The below is the HTML code which I'm trying to scrape

<div class="data-point-container section-break">
    # some other HTML div classes here which I don't need
    <table class data-bind="showHidden: isData">
          <!-- ko foreach : sections -->
        <thead>...</thead>
        <tbody>...</tbody>
        <thead>...</thead>
        <tbody>...</tbody>
        <thead>...</thead>
        <tbody>...</tbody>
        <thead>...</thead>
        <tbody>...</tbody>
        <thead>...</thead>
        <tbody>...</tbody>
          <!-- /ko -->
    </table>
</div>

How do I use Pandas.read_html to scrape all these information, having thead as headers, and tbody as values?

EDIT:

This is the site that I'm trying to scrape, and have the data extracted into Pandas Dataframe. Link here

This actually violates the spec, you cannot have multiple `thead` or `tfoot` elements in a `table`: http://stackoverflow.com/a/16155425/771848. — alecxe, Jul 17 '16 at 02:58
Could you post the complete table? - at least with some `thead` and `tbody` expanded.. — alecxe, Jul 17 '16 at 02:58
Hi alecxe, i've added the link of what i'm trying to scrape. There's too much HTML code for me to put in stackoverflow so I thought it might be better to just show you what data am I trying to get.. — jake wong, Jul 17 '16 at 03:12

score 2 · Accepted Answer · edited May 23 '17 at 12:14

Strictly speaking, one should not have more than one thead element per table according to the table element specification.

If you still have this thead followed by corresponding tbody structure, I would parse that iteratively - every structure like this into it's own dataframe.

Working example:

import pandas as pd
from bs4 import BeautifulSoup

data = """
<div class="data-point-container section-break">
    <table class data-bind="showHidden: isData">

        <thead>
            <tr><th>Customer</th><th>Order</th><th>Month</th></tr>
        </thead>
        <tbody>
            <tr><td>Customer 1</td><td>#1</td><td>January</td></tr>
            <tr><td>Customer 2</td><td>#2</td><td>April</td></tr>
            <tr><td>Customer 3</td><td>#3</td><td>March</td></tr>
        </tbody>

        <thead>
            <tr><th>Customer</th></tr>
        </thead>
        <tbody>
            <tr><td>Customer 4</td></tr>
            <tr><td>Customer 5</td></tr>
            <tr><td>Customer 6</td></tr>
        </tbody>

    </table>
</div>
"""

soup = BeautifulSoup(data, "html.parser")
for thead in soup.select(".data-point-container table thead"):
    tbody = thead.find_next_sibling("tbody")

    table = "<table>%s</table>" % (str(thead) + str(tbody))

    df = pd.read_html(str(table))[0]
    print(df)
    print("-----")

Prints 2 dataframes - one for every thead&tbody in the sample input HTML:

     Customer Order    Month
0  Customer 1    #1  January
1  Customer 2    #2    April
2  Customer 3    #3    March
-----
     Customer
0  Customer 4
1  Customer 5
2  Customer 6
-----

Note that I've intentionally made the number of header and data cells different in every block for demonstration purposes.

I've updated my question for your reference. Thanks for helping me with this! — jake wong, Jul 17 '16 at 03:13
@jakewong sure, please try this solution anyway. It might just work as is. — alecxe, Jul 17 '16 at 03:14
Yes i've just tried it and it worked like a charm! You're a legend! :) — jake wong, Jul 17 '16 at 03:18
sorry to ask another question, but is there a way, to separate the dataframes into 2 variables? eg: `df1, df2` I'm having difficulties putting it in one variable, per dataframe — jake wong, Aug 06 '16 at 10:34

python selenium scraping tbody

1 Answers1

Linked