How to extract all pages URL from a website using python 3?

Question

I want a list of all pages URL from a website. Following code does not return anything:

from bs4 import BeautifulSoup
import requests

base_url = 'http://www.techadvisorblog.com'
response = requests.get(base_url + '/a')
soup = BeautifulSoup(response.text, 'html.parser')

urls = []

for tr in soup.select('tbody tr'):
    urls.append(base_url + tr.td.a['href'])

Can you indicate part of the desired output? And why are you concatenating '/a' onto the end which re-directs to `https://techadvisorblog.com/about-us/` — QHarr, Oct 27 '19 at 15:53

score 0 · Answer 1 · answered Oct 27 '19 at 10:05

The response from the backend is 406. You can overcome that by specifying the user agent.

>>> response = requests.get(base_url + '/a', headers={"User-Agent": "XY"})

Python Requests HTTP Response 406

You can get the urls

>>> for link in soup.find_all('a'):
...     print(link.get('href'))
...
#content
https://techadvisorblog.com/
https://techadvisorblog.com
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/about-us/
https://techadvisorblog.com/disclaimer/
https://techadvisorblog.com/privacy-policy/
None
https://techadvisorblog.com/
https://techadvisorblog.com
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/about-us/
https://techadvisorblog.com/disclaimer/
https://techadvisorblog.com/privacy-policy/
None
https://techadvisorblog.com/
https://www.instagram.com/techadvisorblog
//www.pinterest.com/pin/create/button/?url=https://techadvisorblog.com/about-us/
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/
https://techadvisorblog.com/what-is-world-wide-web-www/
https://techadvisorblog.com/best-free-password-manager-for-windows-10/
https://techadvisorblog.com/solved-failed-to-start-emulator-the-emulator-was-not-properly-closed/
https://techadvisorblog.com/is-telegram-safe/
https://techadvisorblog.com/will-technology-ever-rule-the-world/
https://techadvisorblog.com/category/android/
https://techadvisorblog.com/category/knowledge/basic-computer/
https://techadvisorblog.com/category/games/
https://techadvisorblog.com/category/knowledge/
https://techadvisorblog.com/category/security/
http://Techadvisorblog.com/
http://Techadvisorblog.com
None
None
None
None
None
>>>

score 0 · Answer 2 · answered Oct 27 '19 at 16:04

I am not sure why you are concatenating \a on the end of url as this re-directs to the about-us page. Also, I see no table/tr/td tags to work with on base url or about-us. Instead, if you meant to cycle through the two pages (or more) that are the pagination for the base url you can do this by testing for the presence of the rel attribute with value next. And yes, you need a valid User-Agent header.

import requests
from bs4 import BeautifulSoup as bs

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36',
}

page = 1
with requests.Session() as s:
    s.headers = headers
    while True:
        r = s.get(f'https://techadvisorblog.com/page/{page}/')
        soup = bs(r.content, 'lxml')
        print(soup.select_one('title').text)
        if soup.select_one('[rel=next]') is None:
            break
        page+=1

How to extract all pages URL from a website using python 3?

2 Answers2