I've asked this question before to no avail. I am trying to figure out how to implement bs4 to grab the links to be used for downloading from within the website's source. The problem I can't figure out is the links are within a dynamic content library. I've removed previous html snippet, look below
We've been able to grab the links with this script only after manually grabbing the source code from the website:
import re
enter code here
line = line.rstrip()
x = re.findall('href=[\'"]?([^\'" >]+)tif', line)
if len(x) > 0 :
result.write('tif">link</a><br>\n<a href="'.join(x))
`result.write('tif">link</a><br>\n\n</html>\n</body>\n')
result.write("There are " + len(x) + " links")
print "Download HTML page created."
But only after going into the website ctrl + a -> view source -> select all & copy -> paste onto SourceCode.txt. I would like to remove the manual labor from all this.
I'd greatly appreciate any information/tips/advice!
EDIT
I wanted to add some more information regarding the website we are using, the Library content will only show up when it has been manually expanded. Otherwise, the content (i.e., the download links/href *.tif) are not visible. Here's an example of what we see:
Source Code of site without opening the library element.
<html><body>
Source Code after opening library element.
<html><body>
<h3>Library</h3>
<div id="libraryModalBody">
<div><table><tbody>
<tr>
<td>Tile12</td>
<td><a href="http://www.website.com/path/Tile12.zip">Button</a></td>
</tr>
</tbody></table></div>
</div>
Source code after expanding all download options.
<html><body>
<h3>Library</h3>
<div id="libraryModalBody">
<div><table><tbody>
<tr>
<td>Tile12</td>
<td><a href="http://www.website.com/path/Tile12.zip">Button</a></td>
</tr>
<tr>
<td>Tile12_Set1.tif</td>
<td><a href="http://www.website.com/path/Tile12_Set1.tif">Button</a></td>
</tr>
<tr>
<td>Tile12_Set2.tif</td>
<td><a href="http://www.website.com/path/Tile12_Set2.tif">Button</a></td>
</tr>
</tbody></table></div>
</div>
Our end goal would be to grab the downloads link with only having to input the website url. The issue seems to be in the way the content is displayed (i.e., dynamic content only visible after manual expansion of the library.