Extracting comments from news articles

Question

My question is similar to the one asked here: https://stackoverflow.com/questions/14599485/news-website-comment-analysis I am trying to extract comments from any news article. E.g. i have a news url here: http://www.cnn.com/2013/09/24/politics/un-obama-foreign-policy/ I am trying to use BeautifulSoup in python to extract the comments. However it seems the comment section is either embedded within an iframe or loaded through javascript. Viewing the source through firebug does not reveal the source of the comments section. But explicitly viewing the source of the comments through view-source feature of the browser does. How to go about extracting the comments, especially when the comments come from a different url embedded within the news web-page?

This is what i have done till now although this is not much:

    import urllib2
    from bs4 import BeautifulSoup

    opener = urllib2.build_opener()


    url = ('http://www.cnn.com/2013/08/28/health/stem-cell-brain/index.html')


urlContent = opener.open(url).read()
soup = BeautifulSoup(urlContent)
title = soup.title.text

print title
body = soup.findAll('body')
outfile = open("brain.txt","w+")
for i in body:
    i=i.text.encode('ascii','ignore')
    outfile.write(i +'\n')

Any help in what I need to do or how to go about it will be much appreciated.

You'll need to try something like Selenium to emulate the browser's javascript capabilities too. — Snakes and Coffee, Sep 25 '13 at 06:20
@SnakesandCoffee you don't need js for the specific case. its just an iframe you can download the whole page. — Foo Bar User, Sep 28 '13 at 18:58

Foo Bar User · Answer 1 · 2013-09-28T19:00:30.270

its inside an iframe. check for a frame with id="dsq2".

now the iframe has a src attr which is a link to the actual site that has the comments.

so in beautiful soup: css_soup.select("#dsq2") and get the url from the src attribute. it will lead you to a page that has only comments.

to get the actual comments, after you get the page from src you can use this css selector: .post-message p

and if you want to load more comment, when you click to the more comments buttons it seems to be sending this:

http://disqus.com/api/3.0/threads/listPostsThreaded?limit=50&thread=1660715220&forum=cnn&order=popular&cursor=2%3A0%3A0&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F

Extracting comments from news articles

1 Answers1

Linked