Scraping Wikipedia infobox when table cells are in mixed formats

Question

I'm trying to scrape the Wikipedia infobox and get information for some keywords. For example: https://en.wikipedia.org/wiki/A%26W_Root_Beer

Let's say I'm looking for the values for Manufacturer. I want them in a list, and I only want their text. So in this case the desired output would be ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']. Whatever I try I can't successfully generate this list. Here is a piece of my code:

url = "https://en.wikipedia.org/wiki/ABC_Studios"
soup = BeautifulSoup(requests.get(url), "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})
list_of_table_rows = tbl.findAll('tr')
for tr in list_of_table_rows:

        th = tr.find("th")
        td = tr.find("td")

    # take th.text and td.text

I would like a method that can work in various cases: when there are line breaks in the way, when some of the values are links, when some of the values are paragraphs, etc. In all cases, I only want the text that we see on the screen, not the link, not the paragraph, just plain text. I also don't want the output to be Keurig Dr Pepper (United States, Worldwide)A&W Canada (Canada), as later on I would like to be able to parse the result and do something with each entity.

There are many Wikipedia pages that I'm going through and I can't find a method that works for a good portion of them. Could you help me with working code? I'm not proficient in scraping.

See https://stackoverflow.com/questions/33862336/how-to-extract-information-from-a-wikipedia-infobox — Tgr, Jan 11 '19 at 06:00

imricardoramos · Accepted Answer · 2019-01-10T03:30:05.400

Okay, here's my attempt at doing this (the json library is only to pretty-print the dictionary):

import json
from bs4 import BeautifulSoup
import requests

url = "https://en.wikipedia.org/wiki/ABC_Studios"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})

list_of_table_rows = tbl.findAll('tr')
info = {}
for tr in list_of_table_rows:

        th = tr.find("th")
        td = tr.find("td")
        if th is not None:
            innerText = ''
            for elem in td.recursiveChildGenerator():
                if isinstance(elem, str):
                    innerText += elem.strip()
                elif elem.name == 'br':
                    innerText += '\n'
            info[th.text] = innerText

print(json.dumps(info, indent=1))

The code replaces the <br/> tags with \n, which gives:

{
 "Trading name": "ABC Studios",
 "Type": "Subsidiary\nLimited liability company",
 "Industry": "Television production",
 "Predecessor": "Touchstone Television",
 "Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)",
 "Headquarters": "Burbank, California,U.S.",
 "Area served": "Worldwide",
 "Key people": "Patrick Moran (President)",
 "Parent": "ABC Entertainment Group\n(Disney\u2013ABC Television Group)",
 "Website": "abcstudios.go.com"
}

You can tweak it if you want to return lists instead of strings with \ns

    innerTextList = innerText.split("\n")
    if len(innerTextList) < 2:
        info[th.text] = innerTextList[0]
    else:
        info[th.text] = innerTextList

Which gives:

{
 "Trading name": "ABC Studios",
 "Type": [
  "Subsidiary",
  "Limited liability company"
 ],
 "Industry": "Television production",
 "Predecessor": "Touchstone Television",
 "Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)",
 "Headquarters": "Burbank, California,U.S.",
 "Area served": "Worldwide",
 "Key people": "Patrick Moran (President)",
 "Parent": [
  "ABC Entertainment Group",
  "(Disney\u2013ABC Television Group)"
 ],
 "Website": "abcstudios.go.com"
}

Your answer was close to what I wanted. Thanks! – Tapal Goosal Jan 10 '19 at 07:35 — Tapal Goosal, Jan 10 '19 at 07:35

score 1 · Answer 2 · answered Jan 10 '19 at 04:05

this code will not work

soup = BeautifulSoup(requests.get(url), "lxml")

BeautifulSoup need the requests content, append .text or .content.

To get expected result for manufacture you need to select the a element in the td[class="brand"] then use .next_sibling.string

html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
result = soup.select('td[class="brand"] a')
manufacturer = [a.text + a.next_sibling.string for a in result]
print(manufacturer)
# ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']

Scraping Wikipedia infobox when table cells are in mixed formats

2 Answers2