-1

Answer here : How to join absolute and relative urls?

I want to check internal links with BeautifulSoup and Selenium.

Script is working when links are like this : full url path

<a href="http...." />

Script is NOT working when links are like this : partial url path

<a href="/internal_link.php" />

My python script :

soup=BeautifulSoup(r,'html5lib')
links=[]
for link in soup.findAll('a'):
    set="True"
    for word in exc:
        if word in str(link.get('href')).lower():
            set="False"
            break
    if set=="True":
        try:
            st = re.search('(\S+)', str(link.get('href')).lower())
            st = st.group(0)
            if site in st: # 2 SCENARIOS HERE
                links.append(st)
        except:
            pass

CASE 1 : check all links: full path

if "http" in st:

CASE 2 : Check only internal links: (site is current page) full path

if site in st: 

So, I'm looking for a way to load links even if there is not the full path of the url

HedgeHog
  • 22,146
  • 4
  • 14
  • 36
user3492770
  • 11
  • 1
  • 6
  • You join the relative path with the current URL. See https://stackoverflow.com/a/8223955/5386938 –  Dec 31 '20 at 09:34
  • Does this answer your question? [How to join absolute and relative urls?](https://stackoverflow.com/questions/8223939/how-to-join-absolute-and-relative-urls) – baduker Dec 31 '20 at 09:35

1 Answers1

0

Possible Example

from bs4 import BeautifulSoup

html = '''
<a href="/internal_link.php" />
<a href="http://www.example.com/internal_link.php" />
<a href="/internal_link.php" />

'''

exc = ['http']
url = 'http://www.example.com'

soup=BeautifulSoup(html,'html5lib')
links=[]
for link in soup.findAll('a'):
    for word in exc:
        if word not in str(link.get('href')).lower():
            links.append(''.join([url,link['href']])) 
        if url in str(link.get('href')).lower():
            links.append(link['href']) 
links

Output

['http://www.example.com/internal_link.php',
 'http://www.example.com/internal_link.php',
 'http://www.example.com/internal_link.php']
HedgeHog
  • 22,146
  • 4
  • 14
  • 36