0

I am trying to scrape elements from a website.

<h2 class="a b" data-test-search-result-header-title> Heading </h2>

How can I extract the value Heading from the website using BeautifulSoup?

I have tried the following codes :

Code 1 :

soup.find_all(h2,{'class':['a','b']})

Code 2:

soup.find_all(h2,class_='a b'})

Both the codes return an empty list. How to resolve this?

  • 1
    multiple class names will never get you the element. You some other way to look it up? besides class name? – Abhishek Rai Dec 26 '20 at 21:20
  • 1
    In [a relevant thread](https://www.iditect.com/how-to/55015704.html), someone suggests using the CSS selector, as in `soup.select("h2.a.b")`. – niamulbengali Dec 26 '20 at 21:21
  • `Code 1` gives me this element. But it gives also elements with `class="a"` and `class="b"` – furas Dec 26 '20 at 21:33
  • Does this answer your question? [BeautifulSoup findAll() given multiple classes?](https://stackoverflow.com/questions/18725760/beautifulsoup-findall-given-multiple-classes) – Prayson W. Daniel Dec 26 '20 at 21:35
  • @Prayson W. Daniel In that question the problem is to belong to any of the classes. In this case the element should belong to both the classes – Pravallika Myneni Dec 26 '20 at 22:06
  • Yes, but the thread discussion answers both questions. Let me know if none work and I will attempt to solve your issue. – Prayson W. Daniel Dec 27 '20 at 06:41
  • @PravallikaMyneni : Updated my answer and added further information - Could you provide more code / a minimal functional example, please. – HedgeHog Dec 27 '20 at 10:43

1 Answers1

1

Try to fix code2 to soup.find_all('h2',class_='a b')

Example:

Given are four h2 tags with its classes, soup.find_all('h2',class_='a b') get the first of them, cause it is matching the filter.

To get the text of the h2 element use .text, I have done it with

[heading.text for heading in soup.find_all('h2',class_='a b')]

cause we have to loop the find_all() result.

from bs4 import BeautifulSoup

html = """
<h2 class="a b"> Heading a and b </h2>
<h2 class="b a"> Heading b and a </h2>
<h2 class="a"> Heading a </h2>
<h2 class="b"> Heading b </h2>
"""

soup=BeautifulSoup(html,'html.parser')

[heading.text for heading in soup.find_all('h2',class_='a b')]

Output

[' Heading a and b ']

Further thoughts

You say, that it would not work for you - without providing further code/information, it is hard to help and more guessing. Let me show you what also could be a reason:

Let´s say you are scraping google results, there are a lot of options to do that, I just wanna show two approaches requests and selenium.

Requests Example

Inspected classes for h3 in browser are LC20lb DKV0Md

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.google.com/search?q=stackoverflow')
soup = BeautifulSoup(r.content, 'lxml')
headingsH3Class = soup.find_all('h3', class_='LC20lb DKV0Md')
headingsH3Only = soup.find_all('h3')

print(headingsH3Class[:2])
print(headingsH3Only[:2],'\n')

Requests Example Output

  • An empty list

    []

  • A list that show us that the inspected classes are not in the page content, we get back by requests

_

[<h3 class="zBAuLc"><div class="BNeawe vvjwJb AP7Wnd">Stack Overflow</div></h3>, <h3 class="zBAuLc"><div class="BNeawe vvjwJb AP7Wnd">Stack Overflow (Website) – Wikipedia</div></h3>]

Selenium Example

from selenium import webdriver
from bs4 import BeautifulSoup

url = 'https://www.google.com/search?q=stackoverflow'

browser = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
browser.get(url)

soup = BeautifulSoup(browser.page_source, 'lxml')
headingsH3Class = soup.find_all('h3', class_='LC20lb DKV0Md')
headingsH3Only = soup.find_all('h3')

print(headingsH3Class[:2])
print(headingsH3Only[:2])
browser.close()

Selenium Example Output

  • A List with exactly the h3 with it´s both classes we searched for.

_

[<h3 class="LC20lb DKV0Md"><span>Stack Overflow - Where Developers Learn, Share, &amp; Build ...</span></h3>, <h3 class="LC20lb DKV0Md"><span>Stack Overflow (Website) – Wikipedia</span></h3>]
  • A list with all h3 Elements

_

[<h3 class="LC20lb DKV0Md"><span>Stack Overflow - Where Developers Learn, Share, &amp; Build ...</span></h3>, <h3 class="r"><a class="l" data-ved="2ahUKEwj426uv9u3tAhUPohQKHYymBMAQjBAwAXoECAcQAQ" href="https://stackoverflow.com/questions" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://stackoverflow.com/questions&amp;amp;ved=2ahUKEwj426uv9u3tAhUPohQKHYymBMAQjBAwAXoECAcQAQ">Questions</a></h3>]

Conclusion

Always check the data you are scraping, cause response and inspected things in browser can be different.

HedgeHog
  • 22,146
  • 4
  • 14
  • 36