0

I am trying to use Python and Beautifulsoup to get this page from sfglobe website: http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore. This is the code:

import urllib2
from bs4 import BeautifulSoup 

url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore' 
req = urllib2.urlopen(url) 
html = req.read() 
soup = BeautifulSoup(html) 
desc = soup.find('span', class_='articletext intro')

Could anyone help me to solve this problem?

Michael
  • 32,527
  • 49
  • 210
  • 370
fly2sky
  • 1
  • 2
  • My code is in the following way: import urllib2 from bs4 import BeautifulSoup url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore' req = urllib2.urlopen(url) html = req.read() soup = BeautifulSoup(html) desc = soup.find('span', class_='articletext intro') – fly2sky Apr 30 '15 at 22:10
  • 2
    Edit your question and add the code there, to be more readable. – doru Apr 30 '15 at 22:15

1 Answers1

0

From the question title, I assuming that the only thing you want is the description of the article, which can be found in the <meta> tag within the HTML <head>.

You were on the right track, but I'm not exactly sure why you did:

desc = soup.find('span', class_='articletext intro')

Regardless, I came up with something using requests (see http://stackoverflow.com/questions/2018026/should-i-use-urllib-or-urllib2-or-requests) rather than urllib2

import requests
from bs4 import BeautifulSoup

url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltim\
ore'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html)

tag = soup.find(attrs={'name':'description'}) # find meta tag w/ description
desc = tag['value'] # get value of attribute 'value'

print desc

If that isn't what you are looking for, please clarify so I can try and help you more.

EDIT: after some clarification, I pieced together why you were originally using desc = soup.find('span', class_='articletext intro').

Maybe this is what you are looking for:

import requests
from bs4 import BeautifulSoup, NavigableString

url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html)

body = soup.find('span', class_='articletext intro')

# remove script tags
[s.extract() for s in body('script')]

text = ""

# iterate through non-script elements in the content body
for stuff in body.select('*'):
    # get contents of tags, .contents returns a list
    content = stuff.contents
    # check if the list has the text content a.k.a. isn't empty AND is a NavigableString, not a tag
    if len(content) == 1 and isinstance(content[0], NavigableString):
        text += content[0]

print text
Bryce
  • 248
  • 5
  • 16
  • Thanks a lot for your help. I tried to use your method, while it just returns the first paragraph of the description. How can I get the full description? Thanks a lot. – fly2sky May 01 '15 at 05:05
  • I think I get what you were trying to do with `desc = soup.find('span', class_='articletext intro')` now. Updating my answer. – Bryce May 01 '15 at 21:42
  • Hi, Bryce: Thank you so much for your help. However, for your method, the body is 'Nonetype' on my side (body = soup.find('span', class_='articletext intro')). Can you get some results? Thanks a lot. – fly2sky May 01 '15 at 22:33
  • I get the proper output, which is a paragraph consisting of the text of the article body. Do you have `requests` installed? Can you successfully open that link in your browser? – Bryce May 01 '15 at 23:02
  • All right, I think my problem is " body = soup.find('span', class_='articletext intro')" returns None when I ran on my side. I have installed the requests lib, and I can open the link successfully, It is so wired. – fly2sky May 03 '15 at 20:38
  • Can you make sure that you are not missing any lines of code, and if that doesn't work, you can share your code here so that I can take a look. – Bryce May 04 '15 at 16:12