0

I'm trying to use web crawler to get news contents from sport, homepage, world , business and technology, I have this code where it grab the header of the pages and url in , how can I get the url of the page and open it and get it's content in body

#python code
import requests
from bs4 import BeautifulSoup

url = "https://www.aaa.com"
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
headlines = soup.find('body').find_all('h3')

for title in soup.findAll('a', href=True): #give me type
    if re.search(r"\d+$", title['href']):
      print(title['href'])

1 Answers1

1

You have to join the base url to your extracted href and then simply start over with requesting.

for title in soup.find_all('a', href=True): 
    if re.search(r"\d+$", title['href']):
        
        page = requests.get('https://www.bbc.com'+title['href'])
        soup = BeautifulSoup(page.content, 'html.parser')
        print(soup.h1.text)
Note
  • Your regex is not working that proper, so take care

  • Try to scrape gentle and use time module for example to add some delay

  • There are some urls are duplicated

Example (with some adjustments)

Will print the first 150 characters of the article:

import requests,time
from bs4 import BeautifulSoup
baseurl = 'https://www.bbc.com'

def get_soup(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    return soup

def get_urls(url):
    urls = []
    for link in get_soup(url).select('a:has(h3)'):
        if url.split('/')[-1] in link['href']:
            urls.append(baseurl+link['href'])
    urls = list(set(urls))
    return urls

def get_news(url):
    for url in get_urls(url):
        item = get_soup(url)
        print(item.article.text[:150]+'...')
        time.sleep(2)

get_news('https://www.bbc.com/news')

Output

New Omicron variant: Does southern Africa have enough vaccines?By Rachel Schraer & Jake HortonBBC Reality CheckPublished1 day agoSharecloseShare pageC...
Ghislaine Maxwell: Epstein pilot testifies he flew Prince AndrewPublished9 minutes agoSharecloseShare pageCopy linkAbout sharingRelated TopicsJeffrey ...
New mothers who died of herpes could have been infected by one surgeonBy James Melley & Michael BuchananBBC NewsPublished22 NovemberSharecloseShare pa...
Parag Agrawal: India celebrates new Twitter CEOPublished9 hours agoSharecloseShare pageCopy linkAbout sharingImage source, TwitterImage caption, Parag...
Nimantha
  • 6,405
  • 6
  • 28
  • 69
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
  • thank you, I tried the example code but I got NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type. in get url fucntion in the for loop – geekPinkFlower Dec 01 '21 at 18:38
  • Is your bs4 version up ro date? `pip install beautifulsoup --upgrade` – HedgeHog Dec 01 '21 at 18:56
  • when I run the code I got ERROR: Could not find a version that satisfies the requirement beautifulsoup (from versions: 3.2.0, 3.2.1, 3.2.2) ERROR: No matching distribution found for beautifulsoup – geekPinkFlower Dec 01 '21 at 19:14
  • but I tried to reinstall it !python3 -m pip install beautifulsoup4 and I got this output Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.7/dist-packages (4.6.3) – geekPinkFlower Dec 01 '21 at 19:15