1

I want to scrape data from Yahoo News and 'Bing News' pages. The data that I want to scrape are headlines or/and text below headlines (what ever It can be scraped) and dates (time) when its posted.

I have wrote a code but It does not return anything. Its the problem with my url since Im getting response 404

Can you please help me with it?

This is the code for 'Bing'

from bs4 import BeautifulSoup
import requests

term = 'usa'
url = 'http://www.bing.com/news/q?s={}'.format(term)

response = requests.get(url)
print(response)

soup = BeautifulSoup(response.text, 'html.parser')
print(soup)

And this is for Yahoo:

term = 'usa'

url = 'http://news.search.yahoo.com/q?s={}'.format(term)

response = requests.get(url)
print(response)

soup = BeautifulSoup(response.text, 'html.parser')
print(soup)

Please help me to generate these urls, whats the logic behind them, Im still a noob :)

taga
  • 3,537
  • 13
  • 53
  • 119

1 Answers1

1

Basically your urls are just wrong. The urls that you have to use are the same ones that you find in the address bar while using a regular browser. Usually most search engines and aggregators use q parameter for the search term. Most of the other parameters are usually not required (sometimes they are - eg. for specifying result page no etc..).

Bing

from bs4 import BeautifulSoup
import requests
import re
term = 'usa'
url = 'https://www.bing.com/news/search?q={}'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for news_card in soup.find_all('div', class_="news-card-body"):
    title = news_card.find('a', class_="title").text
    time = news_card.find(
        'span',
        attrs={'aria-label': re.compile(".*ago$")}
    ).text
    print("{} ({})".format(title, time))

Output

Jason Mohammed blitzkrieg sinks USA (17h)
USA Swimming held not liable by California jury in sexual abuse case (1d)
United States 4-1 Canada: USA secure payback in Nations League (1d)
USA always plays the Dalai Lama card in dealing with China, says Chinese Professor (1d)
...

Yahoo

from bs4 import BeautifulSoup
import requests
term = 'usa'
url = 'https://news.search.yahoo.com/search?q={}'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for news_item in soup.find_all('div', class_='NewsArticle'):
    title = news_item.find('h4').text
    time = news_item.find('span', class_='fc-2nd').text
    # Clean time text
    time = time.replace('·', '').strip()
    print("{} ({})".format(title, time))

Output

USA Baseball will return to Arizona for second Olympic qualifying chance (52 minutes ago)
Prized White Sox prospect Andrew Vaughn wraps up stint with USA Baseball (28 minutes ago)
Mexico defeats USA in extras for Olympic berth (13 hours ago)
...
Bitto
  • 7,937
  • 1
  • 16
  • 38
  • Thanks, but is it possible to find time as date? Not n minutes/hours ago, but date, for example 17-11-2019 – taga Nov 17 '19 at 20:33
  • 1
    @taga It is not directly given in the search results. Your best bet would be to calculate the date from `13 hours ago` or `2 days ago` etc.. https://stackoverflow.com/questions/28268818/how-to-find-the-date-n-days-ago-in-python – Bitto Nov 18 '19 at 04:44
  • Hey, can you please tell me how would yahoo and bing link look if I want to loop trough pages? not to scrape only first page, but to scrape first 2,3,4,5 pages – taga Nov 26 '19 at 08:31
  • @taga The link would be same, but I guess you would have to use selenium as they are using Ajax on scroll to populate the results. – Bitto Nov 26 '19 at 16:58
  • Hey, I have checked 'Yahoo' link but, it always returns the data from first page, it does not gives me page 2,3,4... Code works fine, it does not give me errors, but result is wrong. Can you fix it? – taga Nov 28 '19 at 10:39
  • 1
    @taga Try this URL `https://news.search.yahoo.com/search?q={}&pz=10&bct=0&b={}&pz=10` . Where `q` is the same search parameter and `b` changes like 1, 11, 21, 31 etx.. for pages 1, 2, 3, 4... – Bitto Nov 28 '19 at 12:03
  • does anyone know if there are rate limitations for Yahoo and Bing (I recently did a project to search on Google but get blocked). Thks – tezzaaa Sep 14 '20 at 13:08