0

I'm trying to extract table data from a couple thousand html files or site data, but the tables don't have divs to make this easy and I'm pretty new to beautiful soup. Right now I'm manually editing all the converted html to csv and dropping them in my db to create tables, but I'd rather just scrape what I already have.

<
<body style="margin-top:140px;">    
<div id="container">
 <!-- Left div -->
 <div>
  &nbsp;
 </div>
 <!-- Center div -->
 <div>
  <!-- Image Link -->
  <a href="http://www.website.com"><img src="http://website.com/wp-content/uploads/2016/12/Blue-Transparent.png" style = "max-width:100%; max-height:120px;" alt="Center Banner"></a>
 </div>
 <!-- Right div -->
 <div>
  &nbsp;
 </div>
</div>
<A Name = "Top"></A>
<H1>5k Run</H1>
<H1>Overall Finish List</H1>
<H2>September 24, 2022</H2>
<HR noshade>
<B><I> </I></B>
<HR noshade>
<table border=0 cellpadding=0 cellspacing=0 class="racetable">
  <tr>
    <td class=h01 colspan="9"><H2>1st Alarm 5k</H2></td>
  </tr>
  <tr>
    <td class=h11>Place</td>
    <td class=h12>Name</td>
    <td class=h12>City</td>
    <td class=h11>Bib No</td>
    <td class=h11>Age</td>
    <td class=h11>Gender</td>
    <td class=h11>Age Group</td>
    <td class=h11>Total Time</td>
    <td class=h11>Pace</td>
  </tr>
  <tr>
    <td class=d01>1</td>
    <td class=d02>Runner 1</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>390</td>
    <td class=d01>52</td>
    <td class=d01>M</td>
    <td class=d01>1:Overall</td>
    <td class=d01>   18:43.93</td>
    <td class=d01>6:03/M</td>
  </tr>
  <tr>
    <td class=d01>2</td>
    <td class=d02>Runner 2</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>380</td>
    <td class=d01>33</td>
    <td class=d01>M</td>
    <td class=d01>1:19-39</td>
    <td class=d01>   19:31.27</td>
    <td class=d01>6:18/M</td>
  </tr>
  <tr>
    <td class=d01>3</td>
    <td class=d02>Runner 3</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>389</td>
    <td class=d01>65</td>
    <td class=d01>F</td>
    <td class=d01>1:Overall</td>
    <td class=d01>   45:45.20</td>
    <td class=d01>14:46/M</td>
  </tr>
  <tr>
    <td class=d01>4</td>
    <td class=d02>Runner 4</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>381</td>
    <td class=d01>18</td>
    <td class=d01>F</td>
    <td class=d01>1: 1-18</td>
    <td class=d01>   53:28.84</td>
    <td class=d01>17:15/M</td>
  </tr>
  <tr>
    <td class=d01>5</td>
    <td class=d02>Runner 5</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>382</td>
    <td class=d01>41</td>
    <td class=d01>F</td>
    <td class=d01>1:40-59</td>
    <td class=d01>   53:30.48</td>
    <td class=d01>17:16/M</td>
  </tr>
  <tr>
    <td class=d01>6</td>
    <td class=d02>Runner 6</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>384</td>
    <td class=d01>14</td>
    <td class=d01>M</td>
    <td class=d01>1: 1-18</td>
    <td class=d01>   57:38.66</td>
    <td class=d01>18:36/M</td>
  </tr>
  <tr>
    <td class=d01>7</td>
    <td class=d02>Runner 7</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>385</td>
    <td class=d01>72</td>
    <td class=d01>F</td>
    <td class=d01>1:60-99</td>
    <td class=d01>   57:40.11</td>
    <td class=d01>18:36/M</td>
  </tr>
</table>
 
<HR noshade>
<p>
<!-- 0c17  22.0 2e9 -->
</BODY>
</HTML>
>

I've tried adding divs, but haven't had much success.

  • Does this answer your question? [python BeautifulSoup parsing table](https://stackoverflow.com/questions/23377533/python-beautifulsoup-parsing-table) – JonSG Jan 04 '23 at 16:59

2 Answers2

0

For your problem, probably using pandas would be more helpful then, doing data scraping (Assuming: You are only interested in parsing a table from a webpage) .

Here's the simple code to solve your problem:

import pandas as pd # pip install pandas

# Full HTML Page
html="""
<body style="margin-top:140px;">    
<div id="container">
 <!-- Left div -->
......
</table>
<HR noshade>
<p>
<!-- 0c17  22.0 2e9 -->
</BODY>
</HTML>
"""

df=pd.read_html(html)[0] #[0] for taking first table from html

# Got some unnecessary row
df=df.iloc[1:,:] # Removing Unnessary Row with "1st Alarm 5k" data
new_header = df.iloc[0] # Grab the first row for the header
df = df[1:] # Take the data less the header row
df.columns = new_header # Set the header row as the df header

print(df)

# Saving this dataframe to .csv
df.to_csv("[].csv",index=False) # Your HTML page already has Index Column "Place"

For more information about pandas visit documentation page

Some helpful links for above code:

  1. pandas.read_html()
  2. pandas.Series.iloc[]
  3. pandas.read_html()

Update:
OP asks for "I would like to be able to pull all files from a directory and make them one table".

For this problem we can use glob to select all HTML file from given directory, make DataFrame of single file and using pandas.concat() with previous concatenated table...
Code:

import pandas as pd
import glob

def change_header(df):
    new_header = df.iloc[0] # Grab the first row for the header
    df = df[1:] # Take the data less the header row
    df.columns = new_header # Set the header row as the df header
    return df


whole=pd.DataFrame() # Creating empty DataFrame for initial concatenation

for x in glob.glob("/HTML/*.html"): # Path to directory, x gives .html files names
    with open(x,"r",encoding='utf-8') as f:
        df=pd.read_html(f,skiprows=1)[0]
    df=change_header(df)
    whole=pd.concat([whole,df],axis=0)

print(whole)

# Saving this dataframe to .csv
whole.to_csv("[].csv",index=False) # Your HTML page already has Index Column "Place"
imxitiz
  • 3,920
  • 3
  • 9
  • 33
  • Thanks @imxitiz. This helps, but I would like to be able to pull all files from a directory and make them one table. I tried glob, but no luck. Also, what if all the tables don't have the same headings? Will pandas just create new? – rlandis8yahoocom Jan 05 '23 at 15:46
  • Sorry, but I didn't understand "_I would like to be able to pull all files from a directory and make them one table_" – imxitiz Jan 05 '23 at 16:01
  • I have a whole folder of multiple htm files. I would like to combine them into one table. I'm currently using a DB, but the software exports to an html report. I'm trying to make things have as little user interaction as possible to make it work. – rlandis8yahoocom Jan 05 '23 at 22:18
  • "_What if all the tables don't have the same headings? Will pandas just create new?_" Can you please expand on it? – imxitiz Jan 06 '23 at 13:42
  • Traceback (most recent call last): File "C:\xampp\htdocs\racetimes\scrape2.py", line 17, in whole=pd.concat([new,df],axis=0) NameError: name 'new' is not defined I think I've bit off more than I can chew. Also, I mean I have a folder or HTM files I want to combine into one for this. – rlandis8yahoocom Jan 06 '23 at 15:54
  • Sorry, my bad. And for your folder you may need to write `glob.glob("HTML/* .htm")` if your HTM files are in *C:\xampp\htdocs\racetimes\HTML* folder. – imxitiz Jan 06 '23 at 23:39
0

BeautifulSoup allows you to search on something other than div.

Let's say with the html you show that you want to retrieve what looks like runners, you could do something like this.

from bs4 import BeautifulSoup

file_path = 'scrap.html'

with open(file_path, 'r',
          encoding='utf-8') as file:  # We simulate a return from an html request by just opening an .html file
    html_content = file.read()

soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find('table', {"class": "racetable"})  # We are looking for the table with the 'racetable' class
rows_table = table.find_all('tr')[1:]  # All lines in the table without the first one

columns_name = [
    row.get_text() for row in rows_table[0].find_all('td')
]  # We get the name of each column in a list

runners = []
for row in rows_table[1:]:  # We repeat on all the lines except the first one which is the one with the name of the columns
    data = [
        elem.get_text().strip() for elem in row.find_all('td')
    ]
    runner = {
        "place": data[columns_name.index("Place")],
        "name": data[columns_name.index("Name")],
        "city": data[columns_name.index("City")],
        "bib_no": data[columns_name.index("Bib No")],
        "age": data[columns_name.index("Age")],
        "gender": data[columns_name.index("Gender")],
        "age_group": data[columns_name.index("Age Group")],
        "total_time": data[columns_name.index("Total Time")],
        "pace": data[columns_name.index("Pace")]
    }
    print(runner)
    runners.append(runner)

the result of the print would look like this

{'place': '1', 'name': 'Runner 1', 'city': 'ANYTOWN  PA', 'bib_no': '390', 'age': '52', 'gender': 'M', 'age_group': '1:Overall', 'total_time': '18:43.93', 'pace': '6:03/M'}
{'place': '2', 'name': 'Runner 2', 'city': 'ANYTOWN  PA', 'bib_no': '380', 'age': '33', 'gender': 'M', 'age_group': '1:19-39', 'total_time': '19:31.27', 'pace': '6:18/M'}
{'place': '3', 'name': 'Runner 3', 'city': 'ANYTOWN  PA', 'bib_no': '389', 'age': '65', 'gender': 'F', 'age_group': '1:Overall', 'total_time': '45:45.20', 'pace': '14:46/M'}
{'place': '4', 'name': 'Runner 4', 'city': 'ANYTOWN  PA', 'bib_no': '381', 'age': '18', 'gender': 'F', 'age_group': '1: 1-18', 'total_time': '53:28.84', 'pace': '17:15/M'}
{'place': '5', 'name': 'Runner 5', 'city': 'ANYTOWN  PA', 'bib_no': '382', 'age': '41', 'gender': 'F', 'age_group': '1:40-59', 'total_time': '53:30.48', 'pace': '17:16/M'}
{'place': '6', 'name': 'Runner 6', 'city': 'ANYTOWN  PA', 'bib_no': '384', 'age': '14', 'gender': 'M', 'age_group': '1: 1-18', 'total_time': '57:38.66', 'pace': '18:36/M'}
{'place': '7', 'name': 'Runner 7', 'city': 'ANYTOWN  PA', 'bib_no': '385', 'age': '72', 'gender': 'F', 'age_group': '1:60-99', 'total_time': '57:40.11', 'pace': '18:36/M'}
  • Thanks for the help. This simply gives me back errors of not seeing the column names. Traceback (most recent call last): File "C:\xampp\htdocs\racetimes\scrape.py", line 24, in "name": data[columns_name.index("Name")], ValueError: 'Name' is not in list Traceback (most recent call last): File "C:\xampp\htdocs\racetimes\scrape.py", line 23, in "place": data[columns_name.index("Place")], ValueError: 'Place' is not in list – rlandis8yahoocom Jan 05 '23 at 14:59
  • The content of the HTML you are trying to scrape should not be exactly the same as the one you showed above in this case. Your error means that there is no string "Name" and "Place" in the columns_names list ( which I use to retrieve the content of the second pair of tags which is the name of each column in your example) – Vincent Lagache Jan 05 '23 at 15:33
  • Vincent, it's my mistake. Every 100 places it puts in another header row. Is there a way to ignore that? So, I would have of place 100, then a repeat of the headers with a colspan. 100.....   Place Name City Bib No Age Gender Age Group Total Time Pace 101 – rlandis8yahoocom Jan 05 '23 at 22:07