1
import requests
from lxml import html

page = requests.get('http://www.cnn.com')
html_content = html.fromstring(page.content)

for i in html_content.iterchildren():
    print i

news_stories = html_content.xpath('//h2[@data-analytics]/a/span/text()')
news_links = html_content.xpath('//h2[@data-analytics]/a/@href')

I am trying to run this code to understand how web scraping in python works.

I want to scrap top news stories and their links from CNN.

When i run this in Python Shell, the output for news_stories and news_links i get is:

[]

My question is where am i going wrong with this and is there a better way to achieve what i am trying to than this one?

Jamie Counsell
  • 7,730
  • 6
  • 46
  • 81
Jason Bourne
  • 17
  • 1
  • 6

1 Answers1

1

In your code html_content is returning only page address and not the actual content of the page.

html_content = html.fromstring(page.content)

You can try printing following to see complete HTML code for that page:

import requests
from lxml import html

page = requests.get('http://www.cnn.com')
print page.text

Even though if you'll get the content also somehow, you will get it a gzipped response from the server. (Get html using Python requests?)

I would highly recommend you to use httplib2 library and BeautifulSoup to scrape news stories from CNN. That is really handy in use and get you what you want. You can see another stackoverflow post here (retrieve links from web page using python and BeautifulSoup)

I hope that help you.

Community
  • 1
  • 1
disp_name
  • 1,448
  • 2
  • 20
  • 46