0

I am doing a Python exercise, and it requires me to get the top news from the Google news website by web scraping and print to the console. As I was doing it, I just used the Beautiful Soup library to retrieve the news. That was my code:

import bs4
from bs4 import BeautifulSoup
import urllib.request

news_url = "https://news.google.com/news/rss";
URLObject = urllib.request.urlopen(news_url);
xml_page = URLObject.read();
URLObject.close();

soup_page = BeautifulSoup(xml_page,"html.parser");
news_list = soup_page.findAll("item");

for news in news_list:
  print(news.title.text);
  print(news.link.text);
  print(news.pubDate.text);
  print("-"*60);

But it kept giving me errors by not printing the 'link' and 'pubDate'. After some research, I saw some answers here on Stack Overflow, and they said that, as the website uses Javascript, one should use the Selenium package in addition to Beautiful Soup. Despite not understanding how Selenium really works, I updated the code as following:

from bs4 import BeautifulSoup
from selenium import webdriver
import urllib.request

driver = webdriver.Chrome("C:/Users/mauricio/Downloads/chromedriver");
driver.maximize_window();
driver.get("https://news.google.com/news/rss");
content = driver.page_source.encode("utf-8").strip();
soup = BeautifulSoup(content, "html.parser");
news_list = soup.findAll("item");

print(news_list);

for news in news_list:
  print(news.title.text);
  print(news.link.text);
  print(news.pubDate.text);
  print("-"*60);

However, when I run it, a blank browser page opens and this is printed to the console:

 raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: crashed
  (Driver info: chromedriver=2.38.551601 (edb21f07fc70e9027c746edd3201443e011a61ed),platform=Windows NT 6.3.9600 x86_64)
WillMonge
  • 1,005
  • 8
  • 19
maufcost
  • 169
  • 11
  • 1
    I believe the link you have (with the `/rss`) is to an XML file, and so no javascript is being used in there – WillMonge Apr 20 '18 at 17:41
  • So, how can I make both "news.link.text" and "news.pubDate.text" appear in my ouput? When I print them using just Beautiful Soup, "news.title.text" prints normally, the link prints a new line, and pub date is an exception because it returns None Type, and I used ".text" in it. – maufcost Apr 20 '18 at 17:43

2 Answers2

0

I just tried and the following code is working for me. The items = line is horrible, apologies in advance. But for now it works...

EDIT Just updated the snippet, you can use the ElementTree.iter('tag') to iterate over all the nodes with that tag:

import urllib.request
import xml.etree.ElementTree

news_url = "https://news.google.com/news/rss"
with urllib.request.urlopen(news_url) as page:
    xml_page = page.read()

# Parse XML page
e = xml.etree.ElementTree.fromstring(xml_page)

# Get the item list
for it in e.iter('item'):
    print(it.find('title').text)
    print(it.find('link').text)
    print(it.find('pubDate').text, '\n')

EDIT2: Discussion personal preferences of libraries for scraping
Personally, for interactive/dynamic pages in which I have to do stuff (click here, fill a form, obtain results, ...): I use selenium, and usually I don't have a need to use bs4, since you can use selenium directly to find and parse the specific nodes of the web you are looking for.

I use bs4 in conjunction with requests (instead of urllib.request) for to parse more static webpages in projects I don't want to have a whole webdriver installed.

There is nothing wrong with using urllib.request, but requests (see here for the docs) is one of the best python packages out there (in my opinion) and is a great example of how to create a simple yet powerful API.

WillMonge
  • 1,005
  • 8
  • 19
  • That works for me now. I had not heard of the 'xml.etree.ElementTree' before. Is it a more reliable way of web scrapping instead of Beautiful Soup alone or Beautiful Soup + Selenium? Thanks in advance. – maufcost Apr 20 '18 at 19:08
  • ElementTree (or cElementTree for python2) is usually a little bit better at parsing XML than almost any other (python) option. See [here](https://stackoverflow.com/a/19302655/4225467) for a brief comparison on XML parsing in python – WillMonge Apr 20 '18 at 19:21
  • @MauriceFigueiredo I added a second edit with a small discussion of my personal preferences on scraping libraries – WillMonge Apr 20 '18 at 19:39
  • That's perfect, WillMonge. Thanks. – maufcost Apr 21 '18 at 14:51
0

Simply use BeautifulSoup with requests.

from bs4 import BeautifulSoup
import requests

r = requests.get('https://news.google.com/news/rss')
soup = BeautifulSoup(r.text, 'xml')
news_list = soup.find_all('item')

# do whatever you need with news_list
radzak
  • 2,986
  • 1
  • 18
  • 27