How to get links urls from a html page using Beautiful soup

Question

I have a HTML Page with multiple divs like:

<td class="b-list__main">

<a data-gtm="Page B list" href="C.php?bsn=31888&amp;snA=773&amp;tnum=2" class="b-list__main__title">【info】10/23 develop note-new character</a><span class="b-list__main__icon"><i title="有圖片" class="material-icons icon-photo"></i></span>
</td>

I am new to python and BeautifulSoup, I am trying to get all the urls from this class. I have tried:

for lastpage in root.find_all("td", class_="b-list__main"):
        print(lastpage.p)

output:

<p class="b-list__main__title" data-gtm="Page B list" href="C.php?bsn=31888&amp;snA=773&amp;tnum=2">【info】10/23 develop note-new character</p>
<p class="b-list__main__title" data-gtm="Page B list" href="C.php?bsn=31888&amp;snA=774&amp;tnum=1">【Q】alient team choice</p>
<p class="b-list__main__title" data-gtm="Page B list" href="C.php?bsn=31888&amp;snA=772&amp;tnum=1">【Q】lock account question</p>

My ideal output is to get the biggest number, for example 774. But I am doing one step at the time, just try to get url first then number.

C.php?bsn=31888&amp;snA=773&amp;tnum=2
C.php?bsn=31888&amp;snA=774&amp;tnum=1
C.php?bsn=31888&amp;snA=772&amp;tnum=1

I also tried:

     for lastpage in root.find_all("td", class_="b-list__main"):
        link = lastpage.fine('p',href=True)
        if link is None:
            continue
        print(lastpage.p['href'])

but getting TypeError: 'NoneType' object is not subscriptable

Any help is appreciated, thanks.

My code:

import bs4
import re
def getData(url):
    request = req.Request(url, headers={
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 "
    })
    with req.urlopen(request) as response:
        data = response.read().decode("utf-8")
    root = bs4.BeautifulSoup(data, "html.parser")
    for lastpage in root.find_all("td", class_="b-list__main"):
        
        print(lastpage.p)
url = "https://forum.gamer.com.tw/B.php?bsn=31888"
getData(url)

1.) `lastpage.fine(...)` should be `lastpage.find(...)` 2.) It seems, that some `` don't contain `
` tags. — Andrej Kesely, Oct 27 '20 at 23:10
It does, if I `print(lastpage.a)`, it returns a lot of text i dont really need ```

【Q】alien team choice

text message ......... @@

``` — Wing, Oct 27 '20 at 23:15
Are you sure is `
`? I've never seen that HTML syntax. `
` is paragraph, not link, it doesn't have href attribute. — Andrej Kesely, Oct 27 '20 at 23:19
Does this answer your question? [retrieve links from web page using python and BeautifulSoup](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) — Christopher Peisert, Oct 27 '20 at 23:43
Yup, it does return p class, If I `print(lastpage), it returns ```

【info】10/23 new character

``` — Wing, Oct 27 '20 at 23:55

Sushil · Accepted Answer · 2020-10-29T01:07:49.717

I have never seen a p tag with a href attribute, but if that is how the html code looks like, you can try something like this:

from bs4 import BeautifulSoup
import re

html = """
<p class="b-list__main__title" data-gtm="Page B list" href="C.php?bsn=31888&amp;snA=773&amp;tnum=2">【info】10/23 develop note-new character</p>
<p class="b-list__main__title" data-gtm="Page B list" href="C.php?bsn=31888&amp;snA=774&amp;tnum=1">【Q】alient team choice</p>
<p class="b-list__main__title" data-gtm="Page B list" href="C.php?bsn=31888&amp;snA=772&amp;tnum=1">【Q】lock account question</p>
"""

root = BeautifulSoup(html,'html5lib')

links_lst = []

for lastpage in root.find_all("p"):
    links_lst.append(lastpage['href'])

Output:

>>> links_lst
['C.php?bsn=31888&snA=773&tnum=2', 'C.php?bsn=31888&snA=774&tnum=1', 'C.php?bsn=31888&snA=772&tnum=1']

In order to find the largest number, you can use a bit of regex. Just add these lines to the code provided above:

pattern = re.compile('(?<=snA=).*\d{3}')

num_lst = []

for link in links_lst:
    num_lst.append(int(pattern.findall(link)[0]))

print(f"Largest Number = {max(num_lst)} , Full link = {links_lst[num_lst.index(max(num_lst))]}")

Output:

Largest Number = 774 , Full link = C.php?bsn=31888&snA=774&tnum=1

Edit:

Here is the full code:

import bs4
import re
from urllib import request as req

links_lst = []

def getData(url):
    request = req.Request(url, headers={
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 "
    })
    with req.urlopen(request) as response:
        data = response.read().decode("utf-8")
    root = bs4.BeautifulSoup(data, "html.parser")
    for lastpage in root.find_all("div", class_="b-list__tile"):
        try:
            links_lst.append(lastpage.p['href'])
        except:
            pass
    pattern = re.compile('(?<=snA=).*\d{3}')
    
    num_lst = []

    for link in links_lst:
        num_lst.append(int(pattern.findall(link)[0]))

    print(f"Largest Number = {max(num_lst)} , Full link = {links_lst[num_lst.index(max(num_lst))]}")


url = "https://forum.gamer.com.tw/B.php?bsn=31888"
getData(url)

Output:

Largest Number = 774 , Full link = C.php?bsn=31888&snA=774&tnum=1

Thanks for the input, but somehow it does not work from my side. Let me paste the whole code and actual url of the page I am trying to get, this might be easier to debug. — Wing, Oct 28 '20 at 18:13
YW! If my ans has helped u, pls accept my answer as the best answer by clicking on the green tick mark near my ans. Thank you! — Sushil, Oct 29 '20 at 17:42
If you dont mind, can you explain how did you find `class_="b-list__tile"`, checking the source code I cant find any. — Wing, Oct 29 '20 at 18:12
Go to the inspect element tab. Then press `ctrl + f` and type `b-list__tile` into the textbox. You will know where it is. — Sushil, Oct 30 '20 at 04:40
I know why now, I was using the list view instead of tile view, thats why it didnt show up. Thanks again — Wing, Oct 30 '20 at 17:44
YW! If my ans has helped u, pls accept my ans as the best ans by clicking on the green tick mark near my ans. Thank you! — Sushil, Oct 31 '20 at 06:12

How to get links urls from a html page using Beautiful soup

1 Answers1