Get Href by text using Beautifulsoup

Question

I'm using "requests" and "beautifulsoup" to search for all the href links from a webpage with a specific text. I've already made it but if the text comes in a new line, beautifulsoup doesn't "see" it and don't return that link.

soup = BeautifulSoup(webpageAdress, "lxml")

path = soup.findAll('a', href=True, text="Something3")
print(path)

Example:

Like this, it returns Href of Something3 text:

...
<a href="page1/somethingC.aspx">Something3</a>
...

Like this, it doesn't return the Href of Something3 text:

...
<a href="page1/somethingC.aspx">
Something3</a>
...

The difference is that Href text (Something3) is in a new line. And i can't change HTML code because i'm not the webmaster of that webpage.

Any idea how can i solve that?

Note: i've already tried to use soup.replace('\n', ' ').replace('\r', '') but i get the error NoneType' object is not callable.

Thank you all with your answers. You helped me a lot! :) – Bgreat Apr 10 '19 at 14:16 — Bgreat, Apr 10 '19 at 14:16

score 1 · Accepted Answer · answered Apr 10 '19 at 10:26

1

You can use regex to find any text that contains `"Something3":

html = '''<a href="page1/somethingC.aspx">Something3</a>

<a href="page1/somethingC.aspx">
Something3</a>'''


from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, "lxml")

path = soup.findAll('a', href=True, text=re.compile("Something3"))

for link in path:
    print (link['href'])

answered Apr 10 '19 at 10:26

chitown88

27,527
4
30
59

What does re.compile do? – Bgreat Apr 10 '19 at 12:23
read more about it [here](https://stackoverflow.com/questions/452104/is-it-worth-using-pythons-re-compile). But basically allows for matching/looking for patterns, versus trying to find a keyword exclusively. If we did `text="Something3"`, have that `\n` would not return anything because it's not exact. So we would rather look for if that substring is within the whole string. regex is just a way to do that – chitown88 Apr 10 '19 at 12:31
Thanks you for the explanation! – Bgreat Apr 10 '19 at 14:13

score 1 · Answer 2 · answered Apr 10 '19 at 11:59

1

You can use :contains pseudo class with bs4 4.7.1

from bs4 import BeautifulSoup as bs

html = '<a href="page1/somethingC.aspx">Something3</a>'
soup = bs(html, 'lxml')
links = [link.text for link in soup.select('a:contains(Something3)')]
print(links)

answered Apr 10 '19 at 11:59

QHarr

83,427
12
54
101

Neet trick! Now I have to learn about selecting with pseudo classes.... – Jack Fleeting Apr 10 '19 at 15:06
@JackFleeting Yeah. I am really pleased with the new bs4. It has so many more awesome features. – QHarr Apr 10 '19 at 15:14
Din't know that. Thanks! ;) – Bgreat Apr 10 '19 at 16:15
No worries. I would expect it to be faster than regex but haven't tested. – QHarr Apr 10 '19 at 16:18

score 0 · Answer 3 · answered Apr 10 '19 at 11:56

0

And a solution without regex:

path = soup.select('a')
if path[0].getText().strip() == 'Something3':
print(path)

Output:

[<a href="page1/somethingC.aspx">
Something3</a>]

answered Apr 10 '19 at 11:56

Jack Fleeting

24,385
6
23
45

Thanks you for your helping answer – Bgreat Apr 10 '19 at 14:14

Get Href by text using Beautifulsoup

3 Answers3