Retrieve entire HTML with urlopen(url)

Question

I have noticed if I request a url using

urllib.request.urlopen([my_url]).read()

I get something like this:

 <html>
<head>
</head>
<body>
    <span>...</span>
<body>
<script>
</script>


</html>

All the info I want for beautifulsoup is in that <span>...</span> section. If I use a webdriver, then that section is included. But a webdriver seems to take longer, and causes my code to be a bit more messy. Is there way to retrieve the entirety of the HTML doc without using a webdriver?

undetected Selenium · Accepted Answer · 2017-12-15T07:47:06.817

2

Here is a much simpler and easy to read solution to parse the contents of the <span> tag :

import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

my_url = 'https://www.foo.com'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")
span_content = page_soup.findAll("span",{"<attribute_name>":"<attribute_value>"})
print(span_content.text)

edited Dec 15 '17 at 07:47

answered Dec 07 '17 at 08:30

undetected Selenium

183,867
41
278
352

This still only gives me .... If I add an attribute my findAll list is [ ]. The wedDriver seems to be the only way I can find to gather all the contents in the html. If you'd like to see an example of the different methods, I've added the script here: https://github.com/mws75/UserName_by_Tag/blob/master/HashTag_SE_Test.py – Mwspencer Dec 14 '17 at 18:40
1

Hello DebanjanB, I apologize, I didn't have time to test this till now, but your method works great. It's fast, and gets the info I need. I haven't figured out how to load more of the page though, so that's my next step. But if I can figure that out, my webscraper will be much faster than using Selenium. Thanks for your help. – Mwspencer Jan 24 '18 at 16:40

score 1 · Answer 2 · answered Dec 06 '17 at 19:00

1

You can use the famous request library, see if the below code will help you

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.google.com/')
soup = BeautifulSoup(page.text, 'lxml')

span = soup.find_all('span')
print(span)

answered Dec 06 '17 at 19:00

Satish

1,976
1
15
19

I'm still only getting .... Feel free to test the different methods. I've posted the code here: https://github.com/mws75/UserName_by_Tag/blob/master/HashTag_SE_Test.py – Mwspencer Dec 14 '17 at 18:41

Retrieve entire HTML with urlopen(url)

2 Answers2

Linked