How do you find inside a div element with bs4?

Question

I'm making a python script to give the top 5 featured projects on the website scratch.mit.edu. I am using requests to get the data. The element that has the title of those projects is in a div tag, but when I use bs4, it shows no children or descendants of the div tag. How can I look inside of the tag?

I've tried find_all(), find(), .descendants, and .children.

soup.find("div").children

I expected the output of < div id="page">

Umm... Can you give concrete example of input and desired output? — 0xInfection, May 12 '19 at 03:00
get it by page = soup.find("div", {"id": "page"}) ;Now page is a soup type of the div content. — Mohammad Etemaddar, May 12 '19 at 08:39
Possible duplicate of [Beautiful Soup and extracting a div and its contents by ID](https://stackoverflow.com/questions/2136267/beautiful-soup-and-extracting-a-div-and-its-contents-by-id) — Mohammad Etemaddar, May 12 '19 at 08:40

QHarr · Accepted Answer · 2019-05-12T08:32:25.080

API

Use the api the page uses to update content and parse from json response

https://api.scratch.mit.edu/proxy/featured

import requests
import pandas as pd

r = requests.get('https://api.scratch.mit.edu/proxy/featured').json()
project_info  = [(item['title'], 'https://scratch.mit.edu/projects/' + str(item['id'])) for item in r['community_featured_projects'][:6]]
df = pd.DataFrame(project_info , columns = ['Title', 'Link'])
print(df.head())

Selenium

Or, sub-optimal choice, as content is dynamically rendered you could use a method like selenium:

Restrict to the first "box" and then select the child a tags of the thumbnail-title classes and index into list for top 5/ or df.head()

.box:nth-of-type(1) .thumbnail-title > a

py (as noted by @P.hunter - you could run this headless)

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import pandas as pd

options = Options()  
options.add_argument("--headless") 

d = webdriver.Chrome(options = options)
d.get('https://scratch.mit.edu/')
project_info = [(item.get_attribute('title') ,item.get_attribute('href') ) for item in  WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".box:nth-of-type(1) .thumbnail-title > a")))]
df = pd.DataFrame(project_info , columns = ['Title', 'Link'])
d.quit()
print(df)

doing it headless will be better right? ```options.add_argument('--headless') options.add_argument('--disable-gpu') # Last I checked this was necessary. driver = webdriver.Chrome(CHROMEDRIVER_PATH, chrome_options=options) ``` — P.hunter, May 12 '19 at 05:34
@P.hunter What is the advantage of options.add_argument('--disable-gpu') ? Removing unnecessary overhead ? — QHarr, May 12 '19 at 05:41
It was related with some issue with windows and was a temporary flag back then which was to be removed in the later versions, however I did some research and found out that in the new version `--disable-gpu` is not needed now, you can just do ```options = Options() options.headless = True driver = webdriver.Chrome(options=options, executable_path=r'C:\path\to\chromedriver.exe') ``` please update the answer :) — P.hunter, May 12 '19 at 08:31

How do you find inside a div element with bs4?

1 Answers1