Beautiful Soup FeatureNotFound issue

Question

import requests
from bs4 import BeautifulSoup

r = requests.get('https://ca.finance.yahoo.com/quote/AMZN/profile?p=AMZN')
soup = BeautifulSoup(r.content, 'html.parser')
price = soup.find_all('div', {'class':'My(6px) Pos(r) smartphone_Mt(6px)'})
print(price)

So I am new to learning BeautifulSoup but I am slightly confused as to why this returns:

[]

Have I made an error in my code or does BeautifulSoup not pick up the website's code? Also whenever I try something like 'xml' or 'lxml' instead of the 'html.parser' it gives me an error as such:

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: xml. Do you need to install a parser library?

I don't know beautiful soup much but I am unable to find the same class but I can see it in page source when seen from the browser also using data-reactid="29" I cannot find the same div — xeon zolt, May 26 '20 at 05:54
I can locate the code in the web source but I fail to locate it in my actual code for some reason. — Jack Jones, May 26 '20 at 06:12
It does seem like the source in beautifulsoup is different from the source on a browser. Is there a reason you're searching for the class 'My(6px) Pos(r) smartphone_Mt(6px)'? It seems that it's a parent div to other divs that contain information to scrape. — Desmond Cheong, May 26 '20 at 06:31
It looks like the content you are looking for is JS generated. This won't work with requests. You ought to have a look at Selenium, which can work with an actual browser. — S.D., May 26 '20 at 06:31
@DesmondCheong I am searching that because I tried making my search more accurate but it didn't work. So I tried searching for the outer div which had that class. — Jack Jones, May 26 '20 at 06:42
@S.D. I love Selenium but the issue that I have with it is that I have already used it for web automation using the Chrome Driver but I do not know to just web scrape only without opening up a browser or anything. — Jack Jones, May 26 '20 at 06:44
@JackJones you can use headless mode if you do not want to open browser — xeon zolt, May 26 '20 at 06:48
in the headless mode, you can use selenium but no browser UI will be launched — xeon zolt, May 26 '20 at 07:06

score 1 · Answer 1 · answered May 26 '20 at 07:57

The data is stored internally in JavaScript variable. You can use re and json modules to extract the information.

For example:

import re
import json
import requests

url = 'https://ca.finance.yahoo.com/quote/AMZN/profile?p=AMZN'

html_data = requests.get(url).text

data = json.loads(re.search(r'root\.App\.main = ({.*?});\n', html_data).group(1))

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

price = data['context']['dispatcher']['stores']['QuoteSummaryStore']['price']['regularMarketPrice']['fmt']
currency_symbol = data['context']['dispatcher']['stores']['QuoteSummaryStore']['price']['currencySymbol']

print('{} {}'.format(price, currency_symbol))

Prints:

2,436.88 $

score 0 · Answer 2 · answered May 26 '20 at 07:07

As suggested by @S.D. and @xeon zolt, the issue seems to be that the content you're searching for is generated by scripts. In order for Beautiful Soup to parse this, we have to load the web page with a browser then pass the page source to Beautiful Soup.

From your comment I assume you already have Selenium set up. You can load the page in Selenium then pass the page source to beautiful soup as such:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

driver = webdriver.Firefox()

driver.get("https://ca.finance.yahoo.com/quote/AMZN/profile?p=AMZN")

wait = WebDriverWait(driver, 5)

page_source = driver.page_source

driver.close()

soup = BeautifulSoup(page_source, 'html.parser')

Additionally, headless mode simply means that visible ui elements (such as a browser opening then closing) are not visible when you run the script. You can use headless mode by modifying the code to include the following:

from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True

driver = webdriver.Firefox(options=options)

To answer your final question, before using a new parser, you have to install it. For example, if you want to use the lxml parser you should first run in your command line:

$ pip install lxml

Hope this helps!

I see! okay so basically i can use all of selenium in headless mode but I just need to write the code you posted so that i am in headless mode and then just do everything normally? Also is it better just to not pass the page source from selenium to beautifulsoup and just continue web scrapping in selenium itself? — Jack Jones, May 26 '20 at 08:26
Yup, you just need to pass the headless option to the Selenium driver. If you're comfortable with parsing html in Selenium you can definitely continue web scraping in Selenium itself. It might be better in the sense that you don't have to work with an additional library. Although, personally I find the Beautiful Soup interface quite pleasant to work with. — Desmond Cheong, May 26 '20 at 09:01
Just saw @Andrej Kesely's answer too. That's another very good route to explore because it takes the content straight from the javascript included in the page. This removes the need for Selenium which can be a little heavy-handed. — Desmond Cheong, May 26 '20 at 09:06

score 0 · Answer 3 · answered May 26 '20 at 08:09

0

You can switch over to the LXML parser by running this command in Terminal or Command prompt:

pip install lxml

Then try this out:

soup = BeautifulSoup(html, "lxml")

See the document

For more information:Goto this page

answered May 26 '20 at 08:09

Osadhi Virochana

1,294
2
11
21

Beautiful Soup FeatureNotFound issue

3 Answers3