1

I want to extract the contents (full of text) of a paragraph from a news webpages, I have a set of url's from which it should extract only the content of a paragraphs. When i use the code below it gives me whole html page.
Here is my code

import urllib2
import urllib
from cookielib import CookieJar
from bs4 import BeautifulSoup
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
p = opener.open("http://www.nytimes.com/2014/09/09/world/europe/turkey-is-courted-by-us-to-help-         fight-isis.html?module=Search&mabReward=relbias%3Aw%2C%7B%222%22%3A%22RI%3A18%22%7D&_r=0")
print p.read()
soup = BeautifulSoup(p)
content = soup.find('p', attrs= {'class' : 'story-body-text story-content'})
print content

1 Answers1

2

This is because you are having print p.read() line that prints out the whole HTML page.

To get the article text, find it by id and then all paragraphs inside the article.

Example using CSS Selector:

soup = BeautifulSoup(p)
print ''.join(p.text for p in soup.select('article#story p.story-content'))

Prints:

ANKARA, Turkey —  The Obama administration on Monday began the work of trying to determine
...

FYI, article#story p.story-content would match all p tags that have story-content class inside the article tag with story id.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195