-1

Im trying to read this link content via beautifulsoup and then trying to fetch article dates present in span.f

import requests
import json
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36'}
from selenium import webdriver
link="https://www.google.com/search?q=replican+party+announced&ie=utf-8&oe=utf-8&client=firefox-b"
browser=webdriver.Firefox()
browser.get(link)
s=requests.get(link)
soup5 =BeautifulSoup(s.content,'html.parser')

Now i want to fetch all the article dates present in <span class="f">Apr 27, 2018 - </span> along with their corresponding "link URL" But this code aint fetching anything for me

for i in soup5.find_all("div",{"class":"g"}):
    print (i.find_all("span",{"class":"f"}))
vinita
  • 595
  • 1
  • 9
  • 24

2 Answers2

2

You don't need selenium for this task. Use BeautifulSoup's .select() method as below:

import requests
from bs4 import BeautifulSoup
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36'}

link = "https://www.google.com/search?q=replican+party+announced&ie=utf-8&oe=utf-8&client=firefox-b"

r = requests.get(link, headers=headers, timeout=4)

encoding = r.encoding if 'charset' in r.headers.get('content-type','').lower() else None

soup = BeautifulSoup(r.content, 'html.parser', from_encoding=encoding)

for d in soup.select("div.s > div"):
    # check if date exists
    if d.select("span.st > span.f"):
        date = d.select("span.st > span.f")
        link = d.select("div.f > cite")
        print(date[0].text)
        print(link[0].text)

Output:

2018. 4. 27. - 
https://www.cnn.com/2017/11/10/politics/house.../index.html
2018. 3. 19. - 
thehill.com/.../379087-former-gop-lawmaker-announces-hes-leav...
2018. 4. 11. - 
https://www.nytimes.com/2018/04/11/us/.../paul-ryan-speaker.htm...
2017. 10. 24. - 
https://www.theguardian.com/.../jeff-flake-retire-republican-senat...
Zilong Li
  • 889
  • 10
  • 23
  • Thanks that was awesome !!! but Id also like to print the corresponding "link url" in-front of each of these dates. Pls suggest how to do that too – vinita May 23 '18 at 08:27
  • @vinita Hi, I've updated my answer as per your request. – Zilong Li May 23 '18 at 08:39
1

As you are using Selenium so instead of using requests you can easily take out the page_source through BeautifulSoup and invoke find_all() and print the dates as follows :

  • Code Block :

    from bs4 import BeautifulSoup as soup
    from selenium import webdriver
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36'}
    link="https://www.google.com/search?q=replican+party+announced&ie=utf-8&oe=utf-8&client=firefox-b"
    browser = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
    browser.get(link)
    soup5 = soup(browser.page_source,'html.parser')
    print("Dates are as follows : ")
    for i in soup5.find_all("span",{"class":"f"}):
        print (i.text)
    print("Link URLs are as follows : ")
    for i in soup5.find_all("cite",{"class":"iUh30"}):
        print (i.text)
    
  • Console Output :

    Dates are as follows : 
    Mar 19, 2018 - 
    Apr 27, 2018 - 
    Feb 1, 2018 - 
    Apr 17, 2018 - 
    Jan 9, 2018 - 
    Link URLs are as follows : 
    thehill.com/.../379087-former-gop-lawmaker-announces-hes-leaving-gop-tears-into-tr...
    https://edition.cnn.com/2017/11/10/politics/house-retirement-tracker/index.html
    https://en.wikipedia.org/wiki/Republican_Party_presidential_candidates,_2016
    https://www.cbsnews.com/.../joe-scarborough-announces-hes-leaving...
    

Update

Incase you want to print the Dates and Link URLs side by side you can use :

  • Code Block :

    from bs4 import BeautifulSoup as soup
    from selenium import webdriver
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36'}
    link="https://www.google.com/search?q=replican+party+announced&ie=utf-8&oe=utf-8&client=firefox-b"
    browser = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
    browser.get(link)
    soup5 = soup(browser.page_source,'html.parser')
    for i,j in zip(soup5.find_all("span",{"class":"f"}), soup5.find_all("cite",{"class":"iUh30"})):
        print(i.text, j.text)
    
  • Console Output :

    Mar 19, 2018 -  thehill.com/.../379087-former-gop-lawmaker-announces-hes-leaving-gop-tears-into-tr...
    Apr 27, 2018 -  https://edition.cnn.com/2017/11/10/politics/house-retirement-tracker/index.html
    Feb 1, 2018 -  https://en.wikipedia.org/wiki/Republican_Party_presidential_candidates,_2016
    Apr 17, 2018 -  https://www.cbsnews.com/.../joe-scarborough-announces-hes-leaving...
    Jan 9, 2018 -  www.travisgop.com/2018_precinct_conventions
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Thanks but Id also like to print the corresponding "link url" in-front of each of these dates. Any idea as to how to do it – vinita May 23 '18 at 08:31
  • 1
    @vinita Checkout my updated answer and let me know the result – undetected Selenium May 23 '18 at 08:42
  • Thanks :-) , just one last help. I was trying to print the link text by updating your code as(pls tell me if its fine, or is there any other better alternative) Im using split on "-" to remove the date:-- for i,j,k in zip(soup5.find_all("span",{"class":"f"}), soup5.find_all("cite",{"class":"iUh30"}),soup5.find_all("span",{"class":"st"}) ): print(i.text, j.text,k.text.split("-")[1:]) – vinita May 23 '18 at 09:08
  • 1
    @vinita I am afraid :( as I am unable to exactly understand your requirement as in `print the link text` and `split on "-" to remove the date`. Can you raise a new question for your new requirement please? – undetected Selenium May 23 '18 at 09:21
  • link text for first google search is -- "Former GOP Rep. Charles Djou (Hawaii) announced he is leaving the Republican Party." Im splitting it on "-", so as to remove the date "Mar 19, 2018". I dont want this date as Ive already printed this date via your previous code. – vinita May 23 '18 at 09:24