I'm trying to scrape the Wikipedia infobox and get information for some keywords. For example: https://en.wikipedia.org/wiki/A%26W_Root_Beer
Let's say I'm looking for the values for Manufacturer. I want them in a list, and I only want their text. So in this case the desired output would be ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']
.
Whatever I try I can't successfully generate this list. Here is a piece of my code:
url = "https://en.wikipedia.org/wiki/ABC_Studios"
soup = BeautifulSoup(requests.get(url), "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})
list_of_table_rows = tbl.findAll('tr')
for tr in list_of_table_rows:
th = tr.find("th")
td = tr.find("td")
# take th.text and td.text
I would like a method that can work in various cases: when there are line breaks in the way, when some of the values are links, when some of the values are paragraphs, etc. In all cases, I only want the text that we see on the screen, not the link, not the paragraph, just plain text. I also don't want the output to be Keurig Dr Pepper (United States, Worldwide)A&W Canada (Canada)
, as later on I would like to be able to parse the result and do something with each entity.
There are many Wikipedia pages that I'm going through and I can't find a method that works for a good portion of them. Could you help me with working code? I'm not proficient in scraping.