2

I'm trying to scrape html but unfortunately there is very little in the way of classes and id's. The classes that are used are not consistent from page to page. There are several on the page.

I hav some specific that I need to grab.

<a href="http://ExampleText.com/xyz">

and

The contents of the email addresses and URLs will change, but what will always be there will be <a href "http://ExampleText.com and <a href="mailto:

Right now I'm able to grab all the with this code, but I don't know how to get the links with that specific text.

label_links = label_soup.select("div.row  a")
print(label_links)

I'm still really new to Beautifulsoup, but I'm not seeing this in the literature (yet). Any help appreciated!

itsvinayak
  • 140
  • 1
  • 16
Joel
  • 2,691
  • 7
  • 40
  • 72
  • this is same as this question [https://stackoverflow.com/questions/43814754/python-beautifulsoup-how-to-get-href-attribute-of-a-element/43815538](https://stackoverflow.com/questions/43814754/python-beautifulsoup-how-to-get-href-attribute-of-a-element/43815538) – itsvinayak Jun 15 '19 at 03:59
  • this is same as this question [https://stackoverflow.com/questions/43814754/python-beautifulsoup-how-to-get-href-attribute-of-a-element/43815538](https://stackoverflow.com/questions/43814754/python-beautifulsoup-how-to-get-href-attribute-of-a-element/43815538) – itsvinayak Jun 15 '19 at 04:00
  • Did you read my question? I am not trying to get the href attribute. I am trying to use text that is a portion of the attribute in the filtering. If the links you posted are the same, I'm sorry but I don't understand well enough to see the similarity. – Joel Jun 15 '19 at 04:06

2 Answers2

0

re.compile() returns a regular expression object, which means h is a regex object.

The regex object has its own match method with the optional pos and endpos parameters: regex.match(string[, pos[, endpos]])

from bs4 import BeautifulSoup
import re

html = '''
    <div>
    <a href="http://ExampleText.com/xyz">
    <a href="mailto:example@email.com">
    <div>
'''

soup = BeautifulSoup(html, "html.parser")
links = soup.find_all("a",href=True)

def is_valid_url(url):
    regex = re.compile(
        r'^https?://'  # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?|'  # domain...
        r'localhost|'  # localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
        r'(?::\d+)?'  # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)

    return url is not None and regex.search(url)

for i in links:
    _href = i['href']
    is_url = is_valid_url(_href)

    if is_url is not None:
        print("Website link -> ",_href)
    else:
        print("Email address -> ",_href.split(":")[1])

O/P:

Website link ->  http://ExampleText.com/xyz
Email address ->  example@email.com
bharatk
  • 4,202
  • 5
  • 16
  • 30
0

You can use attribute = value css selector with starts with operator ^

links = [item['href'] for item in soup.select('[href^="http://ExampleText.com/"]')]
links2 = [item['href'] for item in soup.select('[href^="mailto:"]')]

[attr^=value]

Represents elements with an attribute name of attr whose value is prefixed (preceded) by value.

QHarr
  • 83,427
  • 12
  • 54
  • 101