scraping data via looping through urls using beautiful soup

Question

I am trying to extract all hotel names for a given country from the following side: https://www.holidaycheck.de/dh/hotels-tunesien/e10cef63-45d4-3511-92f1-43df5cbd9fe1. Given that the data is split across several pages I am trying to set up a loop - Unfortunately I dont manage to extract the pager number of pages(highest page number) from the htlm to tell my loop where to stop. (I know this question has been frequently asked an answered and I read through all the post, but non seems to solve my problem)

The html code looks like this:

<div class="main-nav-items">
<span class="prev-next"
<span>
<i class="prev-arrow icon icon-left-arrow-line"></i>
<span>previous</span>
</span>
</a>
</span>
<span class="other-page">
<a class="link" href="/dh/hotels-tunesien/e10cef63-45d4-3511-92f1-43df5cbd9fe1">66</a>

What I need is the number right after the href from the last line of code (in the given case 66)

I tried it with:

data = soup.find_all('a', {'class':'link'})
y=str(data)
x=re.findall("[0-9]+",y)
print(x)

But this code gives me also the numbers from the href such as 45 and 3511

Additionally I tried:

data = soup.find_all('a', {'class':'link'})
numbers=([d.text for d in data])
print(numbers)

This worked well besides that also next and previous are included and that I didnt manage to convert the output into integers from which i possibly could extract the max and drop previous and next

Besides I tried it with a "while" as explained here: scraping data from unknown number of pages using beautiful soup But somehow this did not return all hotels and skipped pages...

I would highly appreciate if someone could give me some advice on how to fix my problem. Thank you!

Dmitriy Fialkovskiy · Accepted Answer · 2017-07-18T19:13:14.537

0

html = '''<div class="main-nav-items">
<span class="prev-next"
<span>
<i class="prev-arrow icon icon-left-arrow-line"></i>
<span>previous</span>
</span>
</a>
</span>
<span class="other-page">
<a class="link" href="/dh/hotels-tunesien/e10cef63-45d4-3511-92f1-43df5cbd9fe1">66</a>'''

from bs4 import BeautifulSoup as BS

soup = BS(html, 'lxml')
data = soup.find_all('a', {'class':'link'})

res = []
for i in data:
    res.append(i.text) #writing each value to res list

res_int = []
for i in res:
    try:
        res_int.append(int(i))
    except:
        print("current value is not a number")
max(res_int)

edited Jul 18 '17 at 19:13

answered Jul 18 '17 at 16:34

Dmitriy Fialkovskiy

3,065
8
32
47

Thanks a lot for your fast response. The code works (so my first problem is solved), however since I have several links like the above one I now get a list with all the page numbers but I still cant extract the maximium. Any idea on that print(i.text) output: 3 68 Nächste 11 12 13 14 15 16 17 18 19 20 – Nadine Jul 18 '17 at 18:15
Retourns an error message: ValueError: max() arg is an empty sequence @Dmitriy Guess the numbers in the list are not recognized as integers – Nadine Jul 18 '17 at 19:07
Works perfectly, thank you so much. Do I understand it correctly that the code is now looping through the values and if it cant properly convert the value into an integer it gives current value is not a number? If yes what is the difference between doing it with try and expect and converting them all with int() (without try and except) The results differ that I see, but I dont understand why. is there a rule of thumb when to use which approach? – Nadine Jul 18 '17 at 19:25
`try - except` prevents you from getting errors when a string of non-numberic chars would be passed to `int()` function – Dmitriy Fialkovskiy Jul 18 '17 at 19:32

scraping data via looping through urls using beautiful soup

1 Answers1