0

I am trying to extract book names from oreilly media website using python beautiful soup.

However I see that the book names are not in the page source html.

I am using this link to see the books:

https://www.oreilly.com/search/?query=*&extended_publisher_data=true&highlight=true&include_assessments=false&include_case_studies=true&include_courses=true&include_playlists=true&include_collections=true&include_notebooks=true&include_sandboxes=true&include_scenarios=true&is_academic_institution_account=false&source=user&formats=book&formats=article&formats=journal&sort=date_added&facet_json=true&json_facets=true&page=0&include_facets=true&include_practice_exams=true

Attached is a screenshot that shows the webpage with the first two books alongside with chrome developer tool with arrows pointing to the elements i'd like to extract.

oreilly search results

I looked at the page source but could not find the book names - maybe they are hidden inside some other links inside the main html.

I tried to open some of the links inside the html and searched for the book names but could not find anything.

is it possible to extract the first or second book names from the website using beautiful soup? if not is there any other python package that can do that? maybe selenium?

Or as a last resort any other tool...

Rafael Zanzoori
  • 531
  • 1
  • 7
  • 23
  • 2
    Your browser executes JavaScript, which can load additional content and modify previously loaded content. What you're looking at with the Developer Tools is the resulting document model. What you're looking at when you view the source, is the source of the unmodified loaded page. You'll need a solution that executes the JavaScript for you after loading, and `selenium` which you mentioned is such a solution. – Grismar Feb 27 '22 at 22:12
  • 1
    if page uses JavaScript to add item then you could check in `DevTools` (tab `Network`) if it reads data from some url - and then you may try to use `requests` with this url to get data. JavaScript usually get data as JSON which can be simply converted to Python dictionary and it doesn't need beautiful soup. – furas Feb 27 '22 at 22:30

2 Answers2

1

So if you investigate into network tab, when loading page, you are sending request to API Sent request

It returns json with books.

After some investigation by me, you can get your titles via

import json

import requests

response_json = json.loads(requests.get(
    "https://www.oreilly.com/api/v2/search/?query=*&extended_publisher_data=true&highlight=true&include_assessments=false&include_case_studies=true&include_courses=true&include_playlists=true&include_collections=true&include_notebooks=true&include_sandboxes=true&include_scenarios=true&is_academic_institution_account=false&source=user&formats=book&formats=article&formats=journal&sort=date_added&facet_json=true&json_facets=true&page=0&include_facets=true&include_practice_exams=true&orm-service=search-frontend").text)

for book in response_json['results']:
    print(book['highlights']['title'][0])
sudden_appearance
  • 1,968
  • 1
  • 4
  • 15
  • you don't have to convert `requests.get(...).text` to JSON because you can get directly `requests.get(...).json()` – furas Feb 27 '22 at 22:35
  • Thank you. This is pretty much 99% of the solution. when i copy paste your link to the browser i can see all the information i need and even more than what the website exposes. However when running the code i get the following error: SSLError: HTTPSConnectionPool(host='www.oreilly.com', port=443): Max retries exceeded with url: ... (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)'))). Could you please help with that? Or maybe this should be a separate question... – Rafael Zanzoori Feb 27 '22 at 23:16
  • @RafaelZanzoori, I only can refer to [this](https://stackoverflow.com/questions/10667960/python-requests-throwing-sslerror) as a probable solution. – sudden_appearance Feb 28 '22 at 09:59
0

To solve this issue you need to know beautiful soup can deal with websites that use plan html. so the the websites that use JavaScript in their page beautiful soup cant's get all page data that you looking for bcz you need a browser like to load the JavaScript data in the website. and here you need to use Selenium bcz it open a browser page and load all data of the page, and you can use both as a combine like this:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import lxml

# This will make selenium run in backround
chrome_options = Options()
chrome_options.add_argument("--headless")

# You need to install driver
driver = webdriver.Chrome('#Dir of the driver' ,options=chrome_options)
driver.get('#url')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')

and with this you can get all data that you need, and dont forget to write this at end to quit selenium in background.

driver.quit()