1

I am trying to extract this part from a page:

enter image description here

Using the inspect I see that:

enter image description here

Is the structure defined in the inspect view always follows what bs4 returns?

I am using:

import json
import requests
from bs4 import BeautifulSoup

url = "https://docs.google.com/document/d/e/2PACX-1vSWVk1yd_I_zhVROYN2wv1r1y_54M-QL0199ZQ4g9mQZ7QdzekVzsRFUB_JVfkInwLxDNPmrwlY2x7y/pub?fbclid=IwAR0BsTNrbDeLb6j7tU2XhVxeh9WaQU_vELyDS3oNvem3eapiJ1zoBqZIYes"    
soup = BeautifulSoup(requests.get(url).content, "html.parser")  
data = soup.find_all('span',"c2")

But it returns:

[<span class="c2"></span>,
 <span class="c2"></span>,
 <span class="c2">Gyventojai, kuriems yra daugiau nei 65 metai (1.14 prioritetas)</span>,
 <span class="c2"></span>,
 <span class="c2">——————————————————————</span>,
 <span class="c2"></span>,
 <span class="c2"></span>,
 <span class="c2"></span>,
 <span class="c2">Švietimo sistemos darbuotojai bei abiturientai (1.15 prioritetas)</span>,
 <span class="c2">Diplomatai (1.16)</span>,
 <span class="c2">Sergantieji lėtinėmis ligomis (1.17)</span>,
 <span class="c2">Socialinių paslaugų teikėjai (1.18)</span>,
 <span class="c2">1.20 prioritetas: gyvybiškai svarbias valstybės funkcijas atliekantys asmenys, kontaktuojantys su kitais asmenimis (pareigūnai, prekybos įmonių salės darbuotojai ir kt.), išskyrus bendrųjų funkcijų darbuotojus. Šiuo metu šio prioriteto sąrašai nuolat keliami.</span>,
 <span class="c2">Gyventojų grupė 55-64 m.</span>,
 <span class="c2"></span>,
 <span class="c2">.</span>,
 <span class="c2"></span>,
 <span class="c2"></span>,
 <span class="c2"></span>]

Which does not include <p class="c6"><span class="c2">ŠIUO METU - TIK SENJORAI:</span></p>

And I am unsure why because it clearly states class c2 in both inspect view and the data returned by bs4.

Should I always follow the nested structure with multiple find statements or what is the best practice to get the data I desire?

Jonas Palačionis
  • 4,591
  • 4
  • 22
  • 55

1 Answers1

1

The thing is, the CSS class name changes every reload, so sometimes is c7, on reload is c1 and so on.

This example will search for CSS class name that contains "red" color (as your desired text is) and then uses this class name to find your text:

import re
import requests
from bs4 import BeautifulSoup


url = "https://docs.google.com/document/d/e/2PACX-1vSWVk1yd_I_zhVROYN2wv1r1y_54M-QL0199ZQ4g9mQZ7QdzekVzsRFUB_JVfkInwLxDNPmrwlY2x7y/pub?fbclid=IwAR0BsTNrbDeLb6j7tU2XhVxeh9WaQU_vELyDS3oNvem3eapiJ1zoBqZIYes"

html_doc = requests.get(url).text

# find CSS class name that is red:
class_name = re.search(r"\.(c\d+)\{color:#cc0000;", html_doc).group(1)
soup = BeautifulSoup(html_doc, "html.parser")

print(soup.find(class_=class_name).text)

Prints:

ŠIUO METU - TIK SENJORAI:
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91