1

I have noticed if I request a url using

urllib.request.urlopen([my_url]).read()

I get something like this:

 <html>
<head>
</head>
<body>
    <span>...</span>
<body>
<script>
</script>


</html>

All the info I want for beautifulsoup is in that <span>...</span> section. If I use a webdriver, then that section is included. But a webdriver seems to take longer, and causes my code to be a bit more messy. Is there way to retrieve the entirety of the HTML doc without using a webdriver?

Mwspencer
  • 1,142
  • 3
  • 18
  • 35

2 Answers2

2

Here is a much simpler and easy to read solution to parse the contents of the <span> tag :

import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

my_url = 'https://www.foo.com'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")
span_content = page_soup.findAll("span",{"<attribute_name>":"<attribute_value>"})
print(span_content.text)
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • This still only gives me .... If I add an attribute my findAll list is [ ]. The wedDriver seems to be the only way I can find to gather all the contents in the html. If you'd like to see an example of the different methods, I've added the script here: https://github.com/mws75/UserName_by_Tag/blob/master/HashTag_SE_Test.py – Mwspencer Dec 14 '17 at 18:40
  • 1
    Hello DebanjanB, I apologize, I didn't have time to test this till now, but your method works great. It's fast, and gets the info I need. I haven't figured out how to load more of the page though, so that's my next step. But if I can figure that out, my webscraper will be much faster than using Selenium. Thanks for your help. – Mwspencer Jan 24 '18 at 16:40
1

You can use the famous request library, see if the below code will help you

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.google.com/')
soup = BeautifulSoup(page.text, 'lxml')

span = soup.find_all('span')
print(span)
Satish
  • 1,976
  • 1
  • 15
  • 19
  • I'm still only getting .... Feel free to test the different methods. I've posted the code here: https://github.com/mws75/UserName_by_Tag/blob/master/HashTag_SE_Test.py – Mwspencer Dec 14 '17 at 18:41