I am doing a Python exercise, and it requires me to get the top news from the Google news website by web scraping and print to the console. As I was doing it, I just used the Beautiful Soup library to retrieve the news. That was my code:
import bs4
from bs4 import BeautifulSoup
import urllib.request
news_url = "https://news.google.com/news/rss";
URLObject = urllib.request.urlopen(news_url);
xml_page = URLObject.read();
URLObject.close();
soup_page = BeautifulSoup(xml_page,"html.parser");
news_list = soup_page.findAll("item");
for news in news_list:
print(news.title.text);
print(news.link.text);
print(news.pubDate.text);
print("-"*60);
But it kept giving me errors by not printing the 'link' and 'pubDate'. After some research, I saw some answers here on Stack Overflow, and they said that, as the website uses Javascript, one should use the Selenium package in addition to Beautiful Soup. Despite not understanding how Selenium really works, I updated the code as following:
from bs4 import BeautifulSoup
from selenium import webdriver
import urllib.request
driver = webdriver.Chrome("C:/Users/mauricio/Downloads/chromedriver");
driver.maximize_window();
driver.get("https://news.google.com/news/rss");
content = driver.page_source.encode("utf-8").strip();
soup = BeautifulSoup(content, "html.parser");
news_list = soup.findAll("item");
print(news_list);
for news in news_list:
print(news.title.text);
print(news.link.text);
print(news.pubDate.text);
print("-"*60);
However, when I run it, a blank browser page opens and this is printed to the console:
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: crashed
(Driver info: chromedriver=2.38.551601 (edb21f07fc70e9027c746edd3201443e011a61ed),platform=Windows NT 6.3.9600 x86_64)