Extracting nested span / p / div structure using beautiful soup

Question

I am trying to extract this part from a page:

Using the inspect I see that:

Is the structure defined in the inspect view always follows what bs4 returns?

I am using:

import json
import requests
from bs4 import BeautifulSoup

url = "https://docs.google.com/document/d/e/2PACX-1vSWVk1yd_I_zhVROYN2wv1r1y_54M-QL0199ZQ4g9mQZ7QdzekVzsRFUB_JVfkInwLxDNPmrwlY2x7y/pub?fbclid=IwAR0BsTNrbDeLb6j7tU2XhVxeh9WaQU_vELyDS3oNvem3eapiJ1zoBqZIYes"    
soup = BeautifulSoup(requests.get(url).content, "html.parser")  
data = soup.find_all('span',"c2")

But it returns:

[<span class="c2"></span>,
 <span class="c2"></span>,
 <span class="c2">Gyventojai, kuriems yra daugiau nei 65 metai (1.14 prioritetas)</span>,
 <span class="c2"></span>,
 <span class="c2">——————————————————————</span>,
 <span class="c2"></span>,
 <span class="c2"></span>,
 <span class="c2"></span>,
 <span class="c2">Švietimo sistemos darbuotojai bei abiturientai (1.15 prioritetas)</span>,
 <span class="c2">Diplomatai (1.16)</span>,
 <span class="c2">Sergantieji lėtinėmis ligomis (1.17)</span>,
 <span class="c2">Socialinių paslaugų teikėjai (1.18)</span>,
 <span class="c2">1.20 prioritetas: gyvybiškai svarbias valstybės funkcijas atliekantys asmenys, kontaktuojantys su kitais asmenimis (pareigūnai, prekybos įmonių salės darbuotojai ir kt.), išskyrus bendrųjų funkcijų darbuotojus. Šiuo metu šio prioriteto sąrašai nuolat keliami.</span>,
 <span class="c2">Gyventojų grupė 55-64 m.</span>,
 <span class="c2"></span>,
 <span class="c2">.</span>,
 <span class="c2"></span>,
 <span class="c2"></span>,
 <span class="c2"></span>]

Which does not include <p class="c6"><span class="c2">ŠIUO METU - TIK SENJORAI:</span></p>

And I am unsure why because it clearly states class c2 in both inspect view and the data returned by bs4.

Should I always follow the nested structure with multiple find statements or what is the best practice to get the data I desire?

Regarding the selection of certain elements by class name via bs4, take a look at [this post](https://stackoverflow.com/a/22284921/13044314) — Martin H., Apr 18 '21 at 12:44
@WiktorStribiżew, probably, because as Andrej said, the classes keep changing. — Jonas Palačionis, Apr 18 '21 at 16:27

score 1 · Accepted Answer · answered Apr 18 '21 at 16:08

The thing is, the CSS class name changes every reload, so sometimes is c7, on reload is c1 and so on.

This example will search for CSS class name that contains "red" color (as your desired text is) and then uses this class name to find your text:

import re
import requests
from bs4 import BeautifulSoup


url = "https://docs.google.com/document/d/e/2PACX-1vSWVk1yd_I_zhVROYN2wv1r1y_54M-QL0199ZQ4g9mQZ7QdzekVzsRFUB_JVfkInwLxDNPmrwlY2x7y/pub?fbclid=IwAR0BsTNrbDeLb6j7tU2XhVxeh9WaQU_vELyDS3oNvem3eapiJ1zoBqZIYes"

html_doc = requests.get(url).text

# find CSS class name that is red:
class_name = re.search(r"\.(c\d+)\{color:#cc0000;", html_doc).group(1)
soup = BeautifulSoup(html_doc, "html.parser")

print(soup.find(class_=class_name).text)

Prints:

ŠIUO METU - TIK SENJORAI:

@JonasPalačionis That I don't know (maybe Google keeps various versions of the document on its servers?) — Andrej Kesely, Apr 18 '21 at 16:25

Extracting nested span / p / div structure using beautiful soup

1 Answers1