1

I am trying to get only the link from the result of find_all()

Here is my code:

    mydivs = soup.find_all("td", {"class": "candidates"})
    for link in mydivs:
        print(link)

But it returns:

<td class="candidates"><div><a data-tn-element="view-unread-candidates" data-tn-link="true" href="/c#candidates?id=a722443b402&amp;ctx=jobs-tab-view-candidates">56 candidates</a><br/><a data-tn-element="view-unread-candidates" data-tn-link="true" href="/c#candidates?id=a7b2a139b402&amp;candidateFilter=4af15d8991a8"><span class="jobs-u-font--bold">(45 awaiting review)</span></a></div></td>

What I want to get:

/c#candidates?id=a722443b402&amp;ctx=jobs-tab-view-candidates

Solal
  • 611
  • 2
  • 9
  • 26

1 Answers1

0

You can use regex to parse everything between the href and the last quotation mark after converting the bs4 element into a string.

import re

#Rest of imports/code up until your script. 

mydivs = soup.find_all("td", {"class": "candidates"})
or link in mydivs:
   link_text = str(link)
   href_link = re.search('href = "(.+?)"', link_text)
   print(href_link.group(1))

Small Example Shown Below:

import re

link_text = '<td class = "candidates" > <div > <a data-tn-element = "view-unread-candidates" data-tn-link = "true" href = "/c#candidates?id=a722443b402&amp;ctx=jobs-tab-view-candidates" > 56 candidates < /a > <br/> < a data-tn-element = "view-unread-candidates" data-tn-link = "true" href = "/c#candidates?id=a7b2a139b402&amp;candidateFilter=4af15d8991a8" > <span class = "jobs-u-font--bold" > (45 awaiting review) < /span > </a > </div > </td >'
href_link = re.search('href = "(.+?)"', link_text)
print(href_link.group(1))

Output:

/c#candidates?id=a722443b402&amp;ctx=jobs-tab-view-candidates

You may need to work on the spacing with the href = " inside of the re.search since I cannot see what the tag looks like. But all you need to do is copy the exact text from the href up until the first character of the link you want for this to work.

Edeki Okoh
  • 1,786
  • 15
  • 27