Beginner to BeautifulSoup, I am trying to extract the
Company Name, Rank, and Revenue from this wikipedia link.
https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies
The code I've used so far is:
from bs4 import BeautifulSoup
import requests
url = "https://en.wikiepdia.org"
req = requests.get(url)
bsObj = BeautifulSoup(req.text, "html.parser")
data = bsObj.find('table',{'class':'wikitable sortable mw-collapsible'})
revenue=data.findAll('data-sort-value')
I realise that even 'data' is not working correctly as it holds no values when I pass it to the flask website.
Could someone please suggest a fix and the most elegant way to achieve the above as well as some suggestion to the best methodology for what we're looking for in the HTML when scraping (and the format).
On this link, https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies I am not sure what I am meant to use to extract - whether the table class, div class or body class. Furthermore how to go about the extractions of the link and revenue further down the tree.
I've also tried:
data = bsObj.find_all('table', class_='wikitable sortable mw-collapsible')
It runs the server with no errors. However, only an empty list is displayed on the webpage "[]"
Based on one answer below: I updated code to the below:
url = "https://en.wikiepdia.org"
req = requests.get(url)
bsObj = BeautifulSoup(req.text, "html.parser")
mydata=bsObj.find('table',{'class':'wikitable sortable mw-collapsible'})
table_data=[]
rows = mydata.findAll(name=None, attrs={}, recursive=True, text=None, limit=None, kwargs='')('tr')
for row in rows:
cols=row.findAll('td')
row_data=[ele.text.strip() for ele in cols]
table_data.append(row_data)
data=table_data[0:10]
The persistent error is:
File "webscraper.py", line 15, in <module>
rows = mydata.findAll(name=None, attrs={}, recursive=True, text=None, limit=None, kwargs='')('tr')
AttributeError: 'NoneType' object has no attribute 'findAll'
Based on answer below, it is now scraping the data, but not in the format asked for above:
I've got this:
url = 'https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies'
req = requests.get(url)
bsObj = BeautifulSoup(req.text, 'html.parser')
data = bsObj.find('table',{'class':'wikitable sortable mw-collapsible'})
table_data = []
rows = data.find_all('tr')
for row in rows:
cols = row.find_all('td')
row_data = [ele.text.strip() for ele in cols]
table_data.append(row_data)
# First element is header so that is why it is empty
data=table_data[0:5]
for in in range(5):
rank=data[i]
name=data[i+1]
For completeness (and a full answer) I'd like it to be displaying
-The first five companies in the table -The company name, the rank, the revenue
Currently it displays this:
Wikipedia
[[], ['1', 'Amazon', '$280.5', '2019', '798,000', '$920.22', 'Seattle', '1994', '[1][2]'], ['2', 'Google', '$161.8', '2019', '118,899', '$921.14', 'Mountain View', '1998', '[3][4]'], ['3', 'JD.com', '$82.8', '2019', '220,000', '$51.51', 'Beijing', '1998', '[5][6]'], ['4', 'Facebook', '$70.69', '2019', '45,000', '$585.37', 'Menlo Park', '2004', '[7][8]']]
['1', 'Amazon', '$280.5', '2019', '798,000', '$920.22', 'Seattle', '1994', '[1][2]']
['2', 'Google', '$161.8', '2019', '118,899', '$921.14', 'Mountain View', '1998', '[3][4]']