2

I'm trying to figure out how to pull multiple information that I want from the https://www.fda.gov/Safety/Recalls/ website

import requests
from bs4 import BeautifulSoup

res = requests.get("https://www.fda.gov/Safety/Recalls/")
soup = BeautifulSoup(res.text, "lxml")

for item in soup.select("table td"):
    if "Undeclared" in item.text:
        brand = item.find_parents()[0].select("td")[1].text
        reason = item.text
        print(brand,reason)

How do I get the brand_link from the html?

  • I suggest, at the places where you say "recall", consider whether you mean "recall row" or "table cell" and edit accordingly to clarify. Especially the final bit where you say that your code "pulls all recalls that have :". There's missing words there too. :-) – azhrei Nov 29 '17 at 19:21

1 Answers1

1

I suppose this is what your expected output was:

import requests
from bs4 import BeautifulSoup

res = requests.get("https://www.fda.gov/Safety/Recalls/")
soup = BeautifulSoup(res.text, "lxml")

for item in soup.select("table td"):
    if "Undeclared" in item.text:
        brand = item.find_parents()[0].select("td")[1].text
        reason = item.text
        print(brand,reason)

Partial Output:

N/A   Undeclared Milk
Colorado Nut Company and various other private labels   Undeclared milk
All Natural, Weis, generic   Undeclared milk
Dilettante Chocolates   Undeclared almonds
Hot Pockets   Undeclared egg, milk, soy, and wheat
Figiâs   Undeclared Milk
Germack   Undeclared Milk

When you want to get the links to the brand name as well, you can do something like below:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = "https://www.fda.gov/Safety/Recalls/"
res = requests.get(url)
soup = BeautifulSoup(res.text, "lxml")

for item in soup.select("table td"):
    if "Undeclared" in item.text:
        brand = item.find_parents()[0].select("td")[1].text
        brand_link = urljoin(url,item.find_parents()[0].select("td")[1].select("a")[0]['href'])
        reason = item.text
        print("Brand: {}\nBrand_link: {}\nReason: {}\n".format(brand,brand_link,reason))

Output:

Brand: N/A  
Brand_link: https://www.fda.gov/Safety/Recalls/ucm587012.htm
Reason: Undeclared Milk
SIM
  • 21,997
  • 5
  • 37
  • 109
  • All the desired items within the table are finally stored in `td` tag. When you create a conditional statement to reach the data started with `Undeclared`, you may notice that they are in 4th `td` tag. So, when you want to get the brand name you need to revert back to reach the concerning `td` tag. However, `tr` is the parent of each `td` tag. So, when you locate your first search which is `Undeclared` then you need to get back to it's parent which is `tr` and do the next search for 2nd `td` tag in which brand names are embedded. Hope that helps. Be sure to mark this as an answer. – SIM Nov 29 '17 at 21:35
  • Awesome! Quick question: The brand name is also a link. If I wanted to grab that link, how would that work? –  Nov 29 '17 at 22:40
  • Try this `brand_link = item.find_parents()[0].select("td")[1].select("a")[0]['href']`. – SIM Nov 29 '17 at 22:45
  • So when I do that, I get back something like this: /Safety/Recalls/ucm586430.htm is there any way to get the actual link which would be https://www.fda.gov/Safety/Recalls/ucm583556.htm –  Nov 29 '17 at 23:01
  • Sure. As soon as im near my pc i will let you know. I wrote this text from mobile. – SIM Nov 29 '17 at 23:08
  • No problem! I'm thinking there is a way to append https://www.fda.gov to the link? –  Nov 29 '17 at 23:09
  • See the edited part, I've already added the requested portion. – SIM Nov 30 '17 at 04:54
  • Thanks so much!! That is awesome :) –  Nov 30 '17 at 17:39