1

I am trying to scrape a table from this site.:http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm

This table has no id or class and only contains summary and width. Is there any way to scrape this table? Perhaps xpath?

I heard that xpath is not compatible with beautifulsoup and hope that is wrong.

<table width="100%" cellpadding="3" border="1" summary="Layout showing RecallTest table with 6 columns: Date,Brand Name,Product Description,Reason/Problem,Company,Details/Photo" style="margin-bottom:28px">
          <thead>
            <tr>
                    <th scope="col" data-type="numeric" data-toggle="true"> Date </th>
            </tr>
          </thead>
          <tbody>

Here is my code:

import requests
from bs4 import BeautifulSoup
link = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
page = 15
pdf = []
for p in range(1,page+1):
   l = link + '?page='+str(p)
    # Downloading contents of the web page
    data = requests.get(l).text
    # Creating BeautifulSoup object
    soup = BeautifulSoup(data, 'html.parser')
    tables = soup.find_all('table')
    table = soup.find('table', INSERT XPATH EXPRESSION)
    df = pd.DataFrame(columns = ['date','brand','descr','reason','company'])
    for row in table.tbody.find_all('tr'):    
        # Find all data for each column
        columns = row.find_all('td')
        if columns != []:
            date = columns[0].text.strip()
                    
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
rokman54
  • 161
  • 1
  • 3
  • 15

2 Answers2

1

Scraping tables it is best practice to use pandas.read_html() that covers 95% of all cases. Simply iterate the sites and concat the dataframes:

import pandas as pd

url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'

pd.concat(
    [pd.read_html(url+'?page='+str(i))[0]for i in range(1,16)],
    ignore_index=True
)

Note, that you can also include links via extract_links='body'

This will result in:

Date Brand Name Product Description Reason/Problem Company Details/Photo
0 12/31/2015 PharMEDium Norepinephrine Bitartrate added to Sodium Chloride Discoloration PharMEDium Services, LLC nan
1 12/31/2015 Thomas Produce Cucumbers Salmonella Thomas Produce Company nan
2 12/28/2015 Wegmans, Uoriki Fresh Octopus Salad Listeria monocytogenes Uoriki Fresh, Inc. nan
...
433 01/05/2015 Whole Foods Market Assorted cookie platters Undeclared tree nuts Whole Foods Market nan
434 01/05/2015 Eillien's, Blain's Farms and Fleet & more Walnut Pieces Salmonella contamination Eillien’s Candies Inc. nan
435 01/02/2015 Full Tilt Ice Cream Ice Cream Listeria monocytogenes Full Tilt Ice Cream nan
436 01/02/2015 Zilks Hummus Undeclared peanuts Zilks Foods nan

Based on your manually approach simply select the first table, iterate over the rows and store information in a list of dicts , that could be simply converted into a dataframe:

import requests
from bs4 import BeautifulSoup

url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'

data = []

for i in range(1,16):
    soup = BeautifulSoup(requests.get(url+'?page='+str(i)).text)
    for e in soup.table.select('tr:has(td)'):
        data.append({
            'date': e.td.text,
            'any other': 'column',
            'link': e.a.get('href')
        })

data
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
-2

There is just two line of code to get the table from website

import requests
import pandas as pd


url = "http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm"
tables = pd.read_html(requests.get(url).text)

print(tables[0])

You have to use two modules requests and pandas

you can read more about pandas.read_html function from Here

NotTheDr01ds
  • 15,620
  • 5
  • 44
  • 70
Devam Sanghvi
  • 548
  • 1
  • 5
  • 11