How to get open url and get it's content using web crawler

Question

I'm trying to use web crawler to get news contents from sport, homepage, world , business and technology, I have this code where it grab the header of the pages and url in , how can I get the url of the page and open it and get it's content in body

#python code
import requests
from bs4 import BeautifulSoup

url = "https://www.aaa.com"
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
headlines = soup.find('body').find_all('h3')

for title in soup.findAll('a', href=True): #give me type
    if re.search(r"\d+$", title['href']):
      print(title['href'])

score 1 · Accepted Answer · edited Dec 01 '21 at 10:25

You have to join the base url to your extracted href and then simply start over with requesting.

for title in soup.find_all('a', href=True): 
    if re.search(r"\d+$", title['href']):
        
        page = requests.get('https://www.bbc.com'+title['href'])
        soup = BeautifulSoup(page.content, 'html.parser')
        print(soup.h1.text)

Note

Your regex is not working that proper, so take care
Try to scrape gentle and use time module for example to add some delay
There are some urls are duplicated

Example (with some adjustments)

Will print the first 150 characters of the article:

import requests,time
from bs4 import BeautifulSoup
baseurl = 'https://www.bbc.com'

def get_soup(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    return soup

def get_urls(url):
    urls = []
    for link in get_soup(url).select('a:has(h3)'):
        if url.split('/')[-1] in link['href']:
            urls.append(baseurl+link['href'])
    urls = list(set(urls))
    return urls

def get_news(url):
    for url in get_urls(url):
        item = get_soup(url)
        print(item.article.text[:150]+'...')
        time.sleep(2)

get_news('https://www.bbc.com/news')

Output

New Omicron variant: Does southern Africa have enough vaccines?By Rachel Schraer & Jake HortonBBC Reality CheckPublished1 day agoSharecloseShare pageC...
Ghislaine Maxwell: Epstein pilot testifies he flew Prince AndrewPublished9 minutes agoSharecloseShare pageCopy linkAbout sharingRelated TopicsJeffrey ...
New mothers who died of herpes could have been infected by one surgeonBy James Melley & Michael BuchananBBC NewsPublished22 NovemberSharecloseShare pa...
Parag Agrawal: India celebrates new Twitter CEOPublished9 hours agoSharecloseShare pageCopy linkAbout sharingImage source, TwitterImage caption, Parag...

thank you, I tried the example code but I got NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type. in get url fucntion in the for loop — geekPinkFlower, Dec 01 '21 at 18:38
Is your bs4 version up ro date? `pip install beautifulsoup --upgrade` — HedgeHog, Dec 01 '21 at 18:56
when I run the code I got ERROR: Could not find a version that satisfies the requirement beautifulsoup (from versions: 3.2.0, 3.2.1, 3.2.2) ERROR: No matching distribution found for beautifulsoup — geekPinkFlower, Dec 01 '21 at 19:14
but I tried to reinstall it !python3 -m pip install beautifulsoup4 and I got this output Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.7/dist-packages (4.6.3) — geekPinkFlower, Dec 01 '21 at 19:15

How to get open url and get it's content using web crawler

1 Answers1

Note

Example (with some adjustments)

Output