You have to join the base url to your extracted href
and then simply start over with requesting.
for title in soup.find_all('a', href=True):
if re.search(r"\d+$", title['href']):
page = requests.get('https://www.bbc.com'+title['href'])
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.h1.text)
Note
Your regex
is not working that proper, so take care
Try to scrape gentle and use time
module for example to add some delay
There are some urls are duplicated
Example (with some adjustments)
Will print the first 150 characters of the article:
import requests,time
from bs4 import BeautifulSoup
baseurl = 'https://www.bbc.com'
def get_soup(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
return soup
def get_urls(url):
urls = []
for link in get_soup(url).select('a:has(h3)'):
if url.split('/')[-1] in link['href']:
urls.append(baseurl+link['href'])
urls = list(set(urls))
return urls
def get_news(url):
for url in get_urls(url):
item = get_soup(url)
print(item.article.text[:150]+'...')
time.sleep(2)
get_news('https://www.bbc.com/news')
Output
New Omicron variant: Does southern Africa have enough vaccines?By Rachel Schraer & Jake HortonBBC Reality CheckPublished1 day agoSharecloseShare pageC...
Ghislaine Maxwell: Epstein pilot testifies he flew Prince AndrewPublished9 minutes agoSharecloseShare pageCopy linkAbout sharingRelated TopicsJeffrey ...
New mothers who died of herpes could have been infected by one surgeonBy James Melley & Michael BuchananBBC NewsPublished22 NovemberSharecloseShare pa...
Parag Agrawal: India celebrates new Twitter CEOPublished9 hours agoSharecloseShare pageCopy linkAbout sharingImage source, TwitterImage caption, Parag...