1

I have below html table and want to fetch table data i.e "Revenues ($M) $135,987" which exist in first row of table. How to achieve this using python beautifulsoup.

<table data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0">
 <thead data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0">
  <tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0">
   <th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.0" width="200">
   </th>
   <th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-$ millions">
    $ millions
   </th>
   <th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-% change">
    % change
   </th>
  </tr>
 </thead>
 <tbody data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1">
  <tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M)">
   <td class="title" data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).0">
    Revenues ($M)
   </td>
   <td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1">
    $135,987
   </td>
   <td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).2">
    27.1%
   </td>
  </tr>

Script to extract data from direct source:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('http://fortune.com/fortune500/amazon-com/')
soup = bs(r.content, 'html.parser')

result = soup.find('div', {'class': 'small-12 columns'})
table = result.find_all('table')[0] # Grab the first table
print(table.find('td', {'data-reactid': '.romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1'}).text)
Learnings
  • 2,780
  • 9
  • 35
  • 55

1 Answers1

1

Select the 'data-reactid' with the value '.romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1'} and read it's text.

from bs4 import BeautifulSoup

html = """<table data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0">
     <thead data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0">
      <tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0">
       <th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.0" width="200">
       </th>
       <th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-$ millions">
        $ millions
       </th>
       <th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-% change">
        % change
       </th>
      </tr>
     </thead>
     <tbody data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1">
      <tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M)">
       <td class="title" data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).0">
        Revenues ($M)
       </td>
       <td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1">
        $135,987
       </td>
       <td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).2">
        27.1%
       </td>
      </tr>
      <tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M)">
       <td class="title" data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M).0">
        Profits ($M)
       </td>
       <td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M).1">
        $2,371.0
       </td>
       <td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M).2">
        297.8%
       </td>
      </tr>
      </tbody>
    </table>
    """

soup = BeautifulSoup(html, 'html.parser')
print(soup.find('td', {'data-reactid': '.romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1'}).text)

Outputs:

$135,987

Updated in response to comment:

the page is rendered with JavaScript you can use Selenium to render it:

First install Selenium:

sudo pip3 install selenium

Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads you can use a headless version of chrome "Chrome Canary" if you are on Windows or Mac.

import bs4 as bs
from selenium import webdriver

browser = webdriver.Chrome()

url = "http://fortune.com/fortune500/amazon-com/"
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
# print (soup)
tds = soup.find_all('td')
print(tds[1].text)

Or for other non-selenium methods see my answer to Scraping Google Finance (BeautifulSoup)

Dan-Dev
  • 8,957
  • 3
  • 38
  • 55
  • wow super.... actually I want to fetch all useful information from http://fortune.com/fortune500/amazon-com/, I tried some script, added in query, Please check its giving error "AttributeError: 'NoneType' object has no attribute 'text'" – Learnings Aug 05 '17 at 16:00
  • above code working fine. How to disable browser & driver exe opening while running script, I want to run in background as I want to pass around 10 URL to collect data. – Learnings Aug 06 '17 at 05:55
  • 1
    For running chrome headless (not opening in a window) see https://duo.com/blog/driving-headless-chrome-with-python the instructions are for Mac but I think Windows is similar. – Dan-Dev Aug 06 '17 at 11:35
  • I am facing another issue, I have asked it in another query, could you please help on this: https://stackoverflow.com/questions/45533571/webpage-values-are-missing-while-scraping-data-using-beautifulsoup-python-3-6 – Learnings Aug 06 '17 at 15:34