0

I've been attempting to pull some data from a website that appears to have multiple levels of html. It occurred to me from all the examples I've seen that BeautifulSoup is a great product if you're trying to locate data that isn't so nested far down the tree.

For my little project, I'm trying to have BeautifulSoup pull data from the following location.

Any help would be greatly appreciated.

<html lang=“en”>
<body>
<div id=“wrapper”>
<div id=“app_timeline”>
<div id=“timeline-summary”
<div id=“timeline-summary-sticky”>
<div class=“summary-list”>
<div>
<div class=“summary-type”>
<div class=“details”>
<div class=“value”>
<div>
<span class=“number”>100</span>

The number 100 changes daily so I'd like to write something that could pull this data when I run some python code.

TIA

Andy
  • 11
  • Using your method it returns 'None' which makes no sense unless its not going far enough down the html? – Andy Feb 07 '19 at 18:22
  • 1
    Possible duplicate of [Using Python's BeautifiulSoup Library to Parse info in a Span HTML tag](https://stackoverflow.com/questions/51238622/using-pythons-beautifiulsoup-library-to-parse-info-in-a-span-html-tag) – Michael Joy Feb 07 '19 at 20:41
  • Beautifulsoup is a good package and there are a number of Q&A on stackoverflow. It has good support for parsing through nested html structure. Give some more details on what you have tried, what errors you encountered. Try to find how/why similar Q&A will or will not help you and post your findings. https://stackoverflow.com/questions/1501690/parsing-out-data-using-beautifulsoup-in-python , https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text/ ,https://stackoverflow.com/questions/10524387/beautifulsoup-get-text-does-not-strip-all-tags-and-javascript/ – Paul Feb 08 '19 at 02:59

2 Answers2

2

I would use Selenium, I haven't used beautifulsoup in a while. I find Selenium to be easier to extract data. You can find elements in many ways, one being by class.

from selenium import webdriver
chromedriver = 'location of driver'
driver = webdriver.Chrome(chromedriver)
driver.get('url')
data = driver.find_element_by_class('number').text #this would return the first time the class of number is found
data = driver.find_elements_by_class('number') #this would return all the class of numbers
Hugo Iriarte
  • 65
  • 1
  • 8
0

For this task, you would want to use the soup.find() method. soup.find() can help you navigate to the specific html tag i.e. <class> or <div>. Calling .text on the variable will allow you to get the text bewteen <span> </span> tags. So, in your instance, you would want to try

import urllib2
from bs4 import BeautifulSoup    

url = "your_url"
response = urllib2.build_opener(urllib2.HTTPCookieProcessor).open(url)
html_doc = response.read()

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.body.find("span").text)

Output: 100

If you want to be able to store this value and use it later, assign soup.body.find("span").text to a variable. Try looking at this link to get familiar with BeautifulSoup.

  • Surely that is just going to pull the data from within the script itsself? I need it to fetch the from the target site as the data changes daily? – Andy Feb 07 '19 at 18:25
  • @Andy I added the lines of code for you to input your URL. I wrote the initial script based on the HTML that you provided. – Michael Joy Feb 07 '19 at 18:42