Extracting href links from within website source w/ Python

Question

I've asked this question before to no avail. I am trying to figure out how to implement bs4 to grab the links to be used for downloading from within the website's source. The problem I can't figure out is the links are within a dynamic content library. I've removed previous html snippet, look below

We've been able to grab the links with this script only after manually grabbing the source code from the website:

import re
enter code here

line = line.rstrip()
x = re.findall('href=[\'"]?([^\'" >]+)tif', line)
if len(x) > 0 :
    result.write('tif">link</a><br>\n<a href="'.join(x))

 `result.write('tif">link</a><br>\n\n</html>\n</body>\n')

result.write("There are " + len(x) + " links")       


print "Download HTML page created."

But only after going into the website ctrl + a -> view source -> select all & copy -> paste onto SourceCode.txt. I would like to remove the manual labor from all this.

I'd greatly appreciate any information/tips/advice!

EDIT

I wanted to add some more information regarding the website we are using, the Library content will only show up when it has been manually expanded. Otherwise, the content (i.e., the download links/href *.tif) are not visible. Here's an example of what we see:

Source Code of site without opening the library element.

<html><body>

Source Code after opening library element.

<html><body>
<h3>Library</h3>
<div id="libraryModalBody">

    <div><table><tbody>

    <tr>
    <td>Tile12</td>
    <td><a href="http://www.website.com/path/Tile12.zip">Button</a></td>
    </tr>

    </tbody></table></div>

</div>

Source code after expanding all download options.

<html><body>
<h3>Library</h3>
<div id="libraryModalBody">
    <div><table><tbody>
    <tr>
    <td>Tile12</td>
    <td><a href="http://www.website.com/path/Tile12.zip">Button</a></td>
    </tr>
    <tr>
    <td>Tile12_Set1.tif</td>
    <td><a href="http://www.website.com/path/Tile12_Set1.tif">Button</a></td>
    </tr>
    <tr>
    <td>Tile12_Set2.tif</td>
    <td><a href="http://www.website.com/path/Tile12_Set2.tif">Button</a></td>
    </tr>
    </tbody></table></div>
</div>

Our end goal would be to grab the downloads link with only having to input the website url. The issue seems to be in the way the content is displayed (i.e., dynamic content only visible after manual expansion of the library.

score 2 · Answer 1 · edited May 23 '17 at 12:30

2

Do not try and parse HTML with regular expressions. It's not possible and it won't work. Use BeautifulSoup4 instead:

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = "http://www.your-server.com/page.html"
document = urlopen(url)
soup = BeautifulSoup(document)

# look for all URLs:
found_urls = [link["href"] for link in soup.find_all("a", href=True)]

# look only for URLs to *.tif files:
found_tif_urls = [link["href"] for link in soup.find_all("a", href=True) if link["href"].endswith(".tif")]

edited May 23 '17 at 12:30

Community

1
1

answered Nov 24 '15 at 15:24

geckon

8,316
4
35
59

Thanks geckon, question on this. I might be understanding your script incorrectly, but this would work by manually retrieving the html source from the website. If this is the case, this is not removing the manual labor I have been doing already. – D.V Nov 24 '15 at 15:27
That makes sense, thanks for the information. Oddly enough, the regex has been working for what I've been doing. That being said, I'll be using what you wrote. Thanks – D.V Nov 24 '15 at 15:37
@D.V Regular expressions may work for some examples of HTML but there are HTML sources that wouldn't work with them. Is my answer solving your problem now? – geckon Nov 24 '15 at 15:41
your answer definitely shed some insight to a possible flaw with the regex we are using. I'll be giving your example a try and post the outcome. I really appreciate the prompt answers! – D.V Nov 24 '15 at 15:51
I've been playing with your example, I have to fixed an issue with my bs4 import ( it's giving me an issue with the html parse). In either case, thanks for the help. – D.V Nov 24 '15 at 17:50
@D.V Yeah, sorry I had the import wrong, I fixed it now. – geckon Nov 25 '15 at 09:02

score 0 · Answer 2 · answered Nov 24 '15 at 15:55

0

You may as well take a look at PyQuery library, which uses the (sub)set of CSS selectors from JQuery:

pq = PyQuery(body)
pq('div.content div#filter-container div.filter-section')

answered Nov 24 '15 at 15:55

Ojomio

843
1
8
17

Thanks, Ojomio. I'll test out your suggestion. – D.V Nov 24 '15 at 17:51

Extracting href links from within website source w/ Python

2 Answers2