Beautiful Soup and scraping wikipedia entries:

Question

Beginner to BeautifulSoup, I am trying to extract the

Company Name, Rank, and Revenue from this wikipedia link.

https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies

The code I've used so far is:

from bs4 import BeautifulSoup 
import requests 
url = "https://en.wikiepdia.org" 
req = requests.get(url) 
bsObj = BeautifulSoup(req.text, "html.parser") 
data = bsObj.find('table',{'class':'wikitable sortable mw-collapsible'})
revenue=data.findAll('data-sort-value')

I realise that even 'data' is not working correctly as it holds no values when I pass it to the flask website.

Could someone please suggest a fix and the most elegant way to achieve the above as well as some suggestion to the best methodology for what we're looking for in the HTML when scraping (and the format).

On this link, https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies I am not sure what I am meant to use to extract - whether the table class, div class or body class. Furthermore how to go about the extractions of the link and revenue further down the tree.

I've also tried:

data = bsObj.find_all('table', class_='wikitable sortable mw-collapsible')

It runs the server with no errors. However, only an empty list is displayed on the webpage "[]"

Based on one answer below: I updated code to the below:

url = "https://en.wikiepdia.org" 
req = requests.get(url) 
bsObj = BeautifulSoup(req.text, "html.parser") 
mydata=bsObj.find('table',{'class':'wikitable sortable mw-collapsible'})
table_data=[]
rows = mydata.findAll(name=None, attrs={}, recursive=True, text=None, limit=None, kwargs='')('tr')
for row in rows:
    cols=row.findAll('td')
    row_data=[ele.text.strip() for ele in cols]
    table_data.append(row_data)

data=table_data[0:10]

The persistent error is:

 File "webscraper.py", line 15, in <module>
    rows = mydata.findAll(name=None, attrs={}, recursive=True, text=None, limit=None, kwargs='')('tr')
AttributeError: 'NoneType' object has no attribute 'findAll'

Based on answer below, it is now scraping the data, but not in the format asked for above:

I've got this:

url = 'https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies' 
req = requests.get(url) 
bsObj = BeautifulSoup(req.text, 'html.parser')
data = bsObj.find('table',{'class':'wikitable sortable mw-collapsible'})

table_data = []
rows = data.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    row_data = [ele.text.strip() for ele in cols]
    table_data.append(row_data)

# First element is header so that is why it is empty
data=table_data[0:5]

for in in range(5):
    rank=data[i]
    name=data[i+1]

For completeness (and a full answer) I'd like it to be displaying

-The first five companies in the table -The company name, the rank, the revenue

Currently it displays this:

Wikipedia

[[], ['1', 'Amazon', '$280.5', '2019', '798,000', '$920.22', 'Seattle', '1994', '[1][2]'], ['2', 'Google', '$161.8', '2019', '118,899', '$921.14', 'Mountain View', '1998', '[3][4]'], ['3', 'JD.com', '$82.8', '2019', '220,000', '$51.51', 'Beijing', '1998', '[5][6]'], ['4', 'Facebook', '$70.69', '2019', '45,000', '$585.37', 'Menlo Park', '2004', '[7][8]']]

['1', 'Amazon', '$280.5', '2019', '798,000', '$920.22', 'Seattle', '1994', '[1][2]']

['2', 'Google', '$161.8', '2019', '118,899', '$921.14', 'Mountain View', '1998', '[3][4]']

The URL you are scraping is the Wikipedia homepage. This is the `url = "https://en.wikiepdia.org"` part of your code. There is no table on that page so BeautifulSoup is giving you nothing back to index. That is why you're getting an error. You need to replace that URL with one with a table like the one you reference https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies — Eric Leung, Jul 03 '20 at 15:21

Jack Fleeting · Answer 1 · 2020-07-03T15:49:52.707

3

Usually (not always) when dealing with Wikipedia tables, you don't have to bother with beautifulsoup. Just use pandas:

import pandas as pd
table = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies')
table[0]

Output:

    Rank    Company     Revenue ($B)    F.Y.    Employees   Market cap. ($B)    Headquarters    Founded     Refs
0   1   Amazon  $280.5  2019    798000  $920.22     Seattle     1994    [1][2]
1   2   Google  $161.8  2019    118899  $921.14     Mountain View   1998    [3][4]

etc. You can then select or get rid of columns, etc., using standard pandas methods.

Edit: To show only the name, rank and revenue of the top 5 companies:

table[0][["Rank", "Company","Revenue ($B)"]].head(5)

Output:

    Rank Company    Revenue ($B)
0   1   Amazon      $280.5
1   2   Google      $161.8
2   3   JD.com     $82.8
3   4   Facebook    $70.69
4   5   Alibaba     $56.152

edited Jul 03 '20 at 15:49

answered Jul 03 '20 at 14:52

Jack Fleeting

24,385
6
23
45

This is useful. Does it involve download pandas using pip as well, or just the import as you've shown? Unfortunately, I need to be able to do this using BeautifulSoup for learning/teaching purposes. Could you add to your answer for a fix using BeautifulSoup and just fixing the existing code? – Compoot Jul 03 '20 at 14:55
@MissComputing Yes, unfortunately it requires `pip install pandas`. Let me see how to do it with bs. – Jack Fleeting Jul 03 '20 at 14:57
Also, using pandas. could you write your code for the specific question e.g. Showing ALL results for Company Name, Rank and Revenue. – Compoot Jul 03 '20 at 14:59
@Eric Leung beat me to it! – Jack Fleeting Jul 03 '20 at 15:07
That's not working for me, is also not in the format requested? :) – Compoot Jul 03 '20 at 15:10
Changed it to "findAll" and still the same: AttributeError: 'NoneType' object has no attribute 'findAll' – Compoot Jul 03 '20 at 15:11
Sorry - wish I could accept two answers. I used your one as more elegant but had already got it working with Eric's ...thank you! – Compoot Jul 03 '20 at 15:48
@MissComputing That's fine! Eric's was first - I upvoted his too. – Jack Fleeting Jul 03 '20 at 15:51
This has led to another question (trying desperately to get this last step done for tomorrow)! https://stackoverflow.com/questions/62719063/beautiful-soup-python-trying-to-display-scraped-contents-of-a-for-loop-on-an-h <- here in case you're interested :) – Compoot Jul 03 '20 at 16:06

Eric Leung · Accepted Answer · 2020-07-03T15:42:12.747

Here's an example using BeautifulSoup. A lot of the following is based on the answer here https://stackoverflow.com/a/23377804/6873133.

from bs4 import BeautifulSoup 
import requests

url = 'https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies' 
req = requests.get(url) 

bsObj = BeautifulSoup(req.text, 'html.parser')
data = bsObj.find('table',{'class':'wikitable sortable mw-collapsible'})

table_data = []
rows = data.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    row_data = [ele.text.strip() for ele in cols]
    table_data.append(row_data)

# First element is header so that is why it is empty
table_data[0:5]
# [[],
#  ['1', 'Amazon', '$280.5', '2019', '798,000', '$920.22', 'Seattle', '1994', '[1][2]'],
#  ['2', 'Google', '$161.8', '2019', '118,899', '$921.14', 'Mountain View', '1998', '[3][4]'],
#  ['3', 'JD.com', '$82.8', '2019', '220,000', '$51.51', 'Beijing', '1998', '[5][6]'],
#  ['4', 'Facebook', '$70.69', '2019', '45,000', '$585.37', 'Menlo Park', '2004', '[7][8]']]

So isolate certain elements of this list, you just need to be mindful of the numerical index of the inner list. Here, let's look at the first few values for Amazon.

# The entire row for Amazon
table_data[1]
# ['1', 'Amazon', '$280.5', '2019', '798,000', '$920.22', 'Seattle', '1994', '[1][2]']

# Rank
table_data[1][0]
# '1'

# Company
table_data[1][1]
# 'Amazon'

# Revenue
table_data[1][2]
# '$280.5'

So to isolate just the first couple columns (rank, company, and revenue), you can run the following list comprehension.

iso_data = [tab[0:3] for tab in table_data]

iso_data[1:6]
# [['1', 'Amazon', '$280.5'], ['2', 'Google', '$161.8'], ['3', 'JD.com', '$82.8'], ['4', 'Facebook', '$70.69'], ['5', 'Alibaba', '$56.152']]

Then, if you want to put it into a pandas data frame, you can do the following.

import pandas as pd

# The `1` here is important to remove the empty header
df = pd.DataFrame(table_data[1:], columns = ['Rank', 'Company', 'Revenue', 'F.Y.', 'Employees', 'Market cap', 'Headquarters', 'Founded', 'Refs'])

df
#    Rank     Company  Revenue  F.Y. Employees Market cap   Headquarters Founded        Refs
# 0     1      Amazon   $280.5  2019   798,000    $920.22        Seattle    1994      [1][2]
# 1     2      Google   $161.8  2019   118,899    $921.14  Mountain View    1998      [3][4]
# 2     3      JD.com    $82.8  2019   220,000     $51.51        Beijing    1998      [5][6]
# 3     4    Facebook   $70.69  2019    45,000    $585.37     Menlo Park    2004      [7][8]
# 4     5     Alibaba  $56.152  2019   101,958    $570.95       Hangzhou    1999     [9][10]
# ..  ...         ...      ...   ...       ...        ...            ...     ...         ...
# 75   77    Farfetch    $1.02  2019     4,532      $3.51         London    2007  [138][139]
# 76   78        Yelp    $1.01  2019     5,950      $2.48  San Francisco    1996  [140][141]
# 77   79   Vroom.com     $1.1  2020     3,990       $5.2  New York City    2003       [142]
# 78   80  Craigslist     $1.0  2018     1,000          -  San Francisco    1995       [143]
# 79   81    DocuSign     $1.0  2018     3,990     $10.62  San Francisco    2003       [144]
# 
# [80 rows x 9 columns]

Thank you - so I can try it, as an answer, could it be extracting exactly this: Company Name, Rank, Revenue (first 10 records) — Compoot, Jul 03 '20 at 15:05
Also this line is coming up with an error "rows = mydata.find_all('tr')" > AttributeError: 'NoneType' object has no attribute 'find_all' — Compoot, Jul 03 '20 at 15:09
Fixed that! For consistency and helping others that will refer to this (and my students will) could you update answer to show the correct indexes, e.g. isolating NAME, RANK and REVENUE. Will accept then...thanks a million — Compoot, Jul 03 '20 at 15:31
Glad that helped. What do you mean by "isolating"? What should the index be? — Eric Leung, Jul 03 '20 at 15:33
data=table_data[0:5] for in in range(5): rank=data[i] name=data[i+1] .. I mean, the only data I want to have is the company name, the revenue and the rank. (to remove all other data when printed). AND the first ten records. I've tried this loop- — Compoot, Jul 03 '20 at 15:35
For completeness - have updated the answer - see last part. It's just the order that is missing — Compoot, Jul 03 '20 at 15:39
Sorry to make you work for this answer! I'll be so grateful :) — Compoot, Jul 03 '20 at 15:40
Let me know if that works out. I've updated my code to isolate just those three columns you want into a list. — Eric Leung, Jul 03 '20 at 15:42
This has led to another question (trying desperately to get this last step done for tomorrow)! https://stackoverflow.com/questions/62719063/beautiful-soup-python-trying-to-display-scraped-contents-of-a-for-loop-on-an-h <- here in case you're interested :) — Compoot, Jul 03 '20 at 16:06

score 2 · Answer 3 · answered Jul 03 '20 at 15:41

Here's another one, this time with only beautifulsoup, which prints the top 5 companies' rank, name and revenues:

table_data=[]
trs = soup.select('table tr')
for tr in trs[1:6]:
    row = []
    for t in tr.select('td')[:3]:    
        row.extend([t.text.strip()])
    table_data.append(row)
table_data

Output:

[['1', 'Amazon', '$280.5'],
 ['2', 'Google', '$161.8'],
 ['3', 'JD.com', '$82.8'],
 ['4', 'Facebook', '$70.69'],
 ['5', 'Alibaba', '$56.152']]

Beautiful Soup and scraping wikipedia entries:

3 Answers3