How do I get the URLs for all the pages?

Question

I have a code to collect all of the URLs from the "oddsportal" website for a page:

from bs4 import BeautifulSoup
import requests

headers = {
    'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
source = requests.get("https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/",headers=headers)

soup = BeautifulSoup(source.text, 'html.parser')
main_div=soup.find("div",class_="main-menu2 main-menu-gray")
a_tag=main_div.find_all("a")
for i in a_tag:
    print(i['href'])

which returns these results:

/soccer/africa/africa-cup-of-nations/results/
/soccer/africa/africa-cup-of-nations-2019/results/
/soccer/africa/africa-cup-of-nations-2017/results/
/soccer/africa/africa-cup-of-nations-2015/results/
/soccer/africa/africa-cup-of-nations-2013/results/
/soccer/africa/africa-cup-of-nations-2012/results/
/soccer/africa/africa-cup-of-nations-2010/results/
/soccer/africa/africa-cup-of-nations-2008/results/

I would like the URLs to be returned as:

https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/2/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/3/

for all the parent urls generated for results.

I can see that the urls can be appended as seen from inspect element as below for div id = "pagination"

MendelG · Answer 1 · 2021-07-04T03:44:06.977

2

The data under id="pagination" is loaded dynamically, so requests won't support it.

However, you can get the table of all those pages (1-3) via sending a GET request to:

https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/MN8PaiBs/X0/1/0/{page}/?_={timestampe}"

where {page} is corresponding to the page number (1-3) and {timestampe} is the current time

You'll also need to add:

"Referer": "https://www.oddsportal.com/"

to your headers. also, use the lxml parser instead of html.parser to avoid a RecursionError.

import re
import requests
from datetime import datetime
from bs4 import BeautifulSoup

headers = {
    "User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
    "Referer": "https://www.oddsportal.com/",
}


with requests.Session() as session:
    session.headers.update(headers)
    for page in range(1, 4):
        response = session.get(
            f"https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/MN8PaiBs/X0/1/0/{page}/?_={datetime.now().timestamp()}"
        )

        table_data = re.search(r'{"html":"(.*)"}', response.text).group(1)
        soup = BeautifulSoup(table_data, "lxml")
        print(soup.prettify())

edited Jul 04 '21 at 03:44

answered Jul 04 '21 at 02:26

MendelG

14,885
4
25
52

1

1) it's should be `Session()` not `session()`, 2) and you should use `session.get` instead of `requests.get` otherwise you are not doing nothing here! – αԋɱҽԃ αмєяιcαη Jul 04 '21 at 02:46
@αԋɱҽԃαмєяιcαη Ha. not sure how I did that! Thanks. I also always look forward to your answers. – MendelG Jul 04 '21 at 02:49
2

you welcome, Pay attention that you are dealing with `AJAX` timed, where the end parameter is equal to timestamp, with that said, as long as you looking for a fresh result, you should query using the timestamp of current time. `?_=1625363690901"` – αԋɱҽԃ αмєяιcαη Jul 04 '21 at 02:50
3

Leave it for the OP as it's very basic thing to get the current timestamp, Just for you as a site contributor, you don't need to keep assign the headers on each call as `headers=headers`, Just before your `for` loop, you can assign it to the session one time as `session.headers.update(headers)` and that's will increase the performance 10x according to timeit. also you don't need to use `str(response.content)` where you convert bytes to string ! Just use `response.text` directly ! – αԋɱҽԃ αмєяιcαη Jul 04 '21 at 03:06
1

@αԋɱҽԃαмєяιcαη These comments are very useful. Thank you. Is Session always better to use than just vanilla request? – MasayoMusic Jul 04 '21 at 03:10
@αԋɱҽԃαмєяιcαη Thank you. I have "refactored" my answer accordingly :) SO is possible cause of people like you! – MendelG Jul 04 '21 at 03:11
1

@MasayoMusic Yes, [read-that](https://stackoverflow.com/a/66139389/7658985) – αԋɱҽԃ αмєяιcαη Jul 04 '21 at 03:13
@MendelG I don't quite understand the bit about recursion error. I usually use `html.parser` as I am used to it. Why exactly does 'html.parser' cause a recursion error? – MasayoMusic Jul 04 '21 at 03:14
@MasayoMusic I didn't understand why, but when I used `html.parser` I got that error – MendelG Jul 04 '21 at 03:14
2

@MasayoMusic `lxml` is the most quickest parser than `html.parser`, Just use `lxml` according to [docs-comparsion](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser). The recursion error is a not a bug within `html` parser but there's some odd HTML structure which confuse the parser according to [python-bug-reports](https://bugs.launchpad.net/beautifulsoup/+bug/1471755) – αԋɱҽԃ αмєяιcαη Jul 04 '21 at 03:17
@αԋɱҽԃαмєяιcαη I guess I will learn to use `lxml` instead. Thank you. – MasayoMusic Jul 04 '21 at 03:18
@MendelG Thank you for the response however I am getting the entire html code for the page. The output for the code: `

– Jul 04 '21 at 04:04
@PyNoob_N Well, `soup.prettify()` will return the entire HTML. You can use .`.find()` to search for specific elements. – MendelG Jul 04 '21 at 04:05
Oh, what should I use in this case? `id = pagination` ? – Jul 04 '21 at 04:10
@PyNoob_N You'll have to inspect the HTML being returned. Also, see [python BeautifulSoup parsing table](https://stackoverflow.com/questions/23377533/python-beautifulsoup-parsing-table) – MendelG Jul 04 '21 at 04:11
So this is not the answer then? I am sorry but that was not the question. I appreciate the attempt at the solution however thats not what I was looking for. Thanks. – Jul 04 '21 at 04:56
@PyNoob_N You _can't_ get the data from `id=pagination` since it's not rendered on the page, the page is loaded dynamically, and `requests` doesn't support dynamic pages. However, I have provided an alternative on how to navigate to the next pages and extract the table (that's the available data) via sending a request to the `Ajax` page. See [Web-scraping JavaScript page with Python](https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python) – MendelG Jul 04 '21 at 05:05
Oh, I get that. So is there _no way_ I can get the child page urls via code for this webiste? Assuming I am not doing anyting the webiste admins dont want. – Jul 04 '21 at 05:51
1

@PyNoob_N Yes, correct. It's because the page is loaded dynamically – MendelG Jul 04 '21 at 05:52
I cannot accept this answer as it **is not** the answer to this question keeping with the principles of SO. However, I thank you for pointing me in the right direction. – Jul 04 '21 at 07:30
1

I think you can easily generate the answer you are asking for from @MendelG's answer (+). In the first request response, under the html key, there is both the number of pages (xpage - I think take the last match or parse as html and use appropriate css selector) and the base url. – QHarr Jul 04 '21 at 08:03
@QHarr BTW I don't think the OP gets a notification about your comment, because you tagged me. I would either way get a notification since it's my answer. – MendelG Jul 04 '21 at 08:08
1

Thanks for reminding me. I am just testing the theory of calculating from what is matched by x-page attribute at present then will report back. Need also to see if last page number is always visible, even with larger result sets. – QHarr Jul 04 '21 at 08:10
`pages = int(re.search(r'(\d+)', [i['href'] for i in soup.select('[x-page]')][-1]).group(1))` – QHarr Jul 04 '21 at 08:13
`if pages > 1: links= [f'https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/{i}/' for i in range(2, pages + 1)]` – QHarr Jul 04 '21 at 08:16
Guess would need to test if that holds for all the years, can't see why not. Though, need to dynamically grab that start part of url which I thought saw in html. – QHarr Jul 04 '21 at 08:17
Annoyingly, looks like you need to make an initial request to the landing page and grab the urls for each of the years from there e.g. `.main-menu-gray [href$="results/"]`, then use those as your base with later requests. – QHarr Jul 04 '21 at 08:25

How do I get the URLs for all the pages?

1 Answers1