1

I'm not sure I'm asking this question correctly, but I ran into something I've never seen before (FWIW), and since research didn't come up with anything exactly like this, am confused:

Trying to scrape certain links from this page. I go through the usual

r = requests.get(url)
html = r.text
soup =  bs4(html, "lxml")

Trying to locate certain links, I do:

exh = soup.find_all('a')

The output contains a couple of the usual format URLs, but many of them have this form (chosen randomly):

exhibit103.htm

On the Firefox page, this entry looks like this:

enter image description here

Note that this entry does not appear clickable, but if you hover over it, it flashes the actual underlying link.

What I consider the relevant part of the html/css for this section looks like this:

<td>
  <div>
      <a style="-sec-extract:exhibit;"href="exhibit103.htm">
       <span>Amendment Two [etc.]
           </span>
      </a>
   </div>
</td>

It looks to my uninformed eyes like an href inside another href/nested links. So the general question is - why would anyone bother with this? The more important one (to me) is how do I use BeautifulSoup (or any other method) to extract the actual link?

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • 1
    Nested links are invalid markup, but the sample you've shown does not contain nested links; just a simple href (with a relative URL). if it's the relative path that you're asking about, you can [convert that to an absolute url](https://stackoverflow.com/questions/44001007/scrape-the-absolute-url-instead-of-a-relative-path-in-python) – Daniel Beck Apr 12 '19 at 18:05
  • Wow, they were relative links! For some reason, I didn't think of that possibility. Thanks mucho! – Jack Fleeting Apr 12 '19 at 19:09
  • No trouble, glad to help! – Daniel Beck Apr 12 '19 at 19:33

0 Answers0