1

I have a website in the following format:

<html lang="en">
<head>
    #anything
</head>
<body>
    <div id="div1">
        <div id="div2">
            <div class="class1">
                #something
            </div>
            <div class="class2">
                #something
            </div>
            <div class="class3">
                <div class="sub-class1">
                    <div id="statHolder">
                        <div class="Class 1 of 15">
                            "Name"
                            <b>Bob</b>
                        </div>
                        <div class="Class 2 of 15">
                            "Age"
                            <b>24</b>
                        </div>
                        # Here are 15 of these kinds
                    </div>
                </div>
            </div>
        </div>
    </div>
</body>
</html>

I want to retrieve all the content in those 15 classes. How do I do that?

Edit: My Current Approach:

import requests
from bs4 import BeautifulSoup

url = 'my-url-here'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
name_box = soup.findAll('div', {"id": "div1"}) #I dont know what to do after this

Expected Output:

Name: Bob
Age: 24
#All 15 entries like this

I am using BeautifulSoup4 for this. Is there any direct way to get all the contents in <div id="stats">?

costaparas
  • 5,047
  • 11
  • 16
  • 26
AwesomeSam
  • 153
  • 1
  • 16

2 Answers2

2

Based on the HTML above, you can try it this way:

import requests
from bs4 import BeautifulSoup

result = {}
url = 'my-url-here'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
stats = soup.find('div', {'id': 'statHolder'})
for data in stats.find_all('div'):
    key, value = data.text.split()
    result[key.replace('"', '')] = value

print(result)
# Prints:
# [{'Name': 'Bob'}, {'Age': '24'}]

for key, value in result.items():
    print(f'{key}: {value}')
# Prints: 
# Name: Bob
# Age: 24

This finds the div with the id of statHolder.

Then, we find all divs inside that div, and extract the two lines of text (using split) -- the first line being the key, and the second line being the value. We also remove the double quotes from the value using replace.

Then, we add the key-value pair to our result dictionary.

Iterating through this, you can get the desired output as shown.

costaparas
  • 5,047
  • 11
  • 16
  • 26
  • Error `for data in stats.findAll('div'): AttributeError: 'NoneType' object has no attribute 'findAll'` – AwesomeSam Jan 10 '21 at 06:43
  • I see in the HTML the parent `div` has the `id` of `stats`, so used that, i.e. `
    `. If its something different, then you'll have to adjust it to the right ID
    – costaparas Jan 10 '21 at 06:46
  • I have changed it obviously. Renamed `stats` to the orignal name on the website – AwesomeSam Jan 10 '21 at 06:48
  • Also, I have given the website, so now you can try too.. – AwesomeSam Jan 10 '21 at 06:50
  • That HTML is inside a JavaScript function on the site - so its a bit tricky to parse it this way. You'll need to use Selenium to get the page, and for the JS to run an then you can extract the data. – costaparas Jan 10 '21 at 07:02
  • Ok thanks. I am not familiar with selenium anyways.. will have to learn it first – AwesomeSam Jan 10 '21 at 07:03
  • Ok sure, its not a lot of extra work, [this](https://stackoverflow.com/questions/13960326/how-can-i-parse-a-website-using-selenium-and-beautifulsoup-in-python) post pretty much shows you how to replace the `requests` with using Selenium instead to get the source. – costaparas Jan 10 '21 at 07:11
2

If you do it according to the actual html of the webpage the following will give you the stats as a dictionary. It takes each element with class pSt as the key and then moves to the following strong tag to get the associated value.

from bs4 import BeautifulSoup as bs
#html is response.content assuming not dynamic
soup = bs(html, 'html.parser')
stats = {i.text:i.strong.text for i in soup.select('.pSt')}

For your shown html you can use stripped_strings to get the first sibling

from bs4 import BeautifulSoup as bs

html = '''
<html lang="en">
<head>
    #anything
</head>
<body>
    <div id="div1">
        <div id="div2">
            <div class="class1">
                #something
            </div>
            <div class="class2">
                #something
            </div>
            <div class="class3">
                <div class="sub-class1">
                    <div id="statHolder">
                        <div class="Class 1 of 15">
                            "Name"
                            <b>Bob</b>
                        </div>
                        <div class="Class 2 of 15">
                            "Age"
                            <b>24</b>
                        </div>
                        # Here are 15 of these kinds
                    </div>
                </div>
            </div>
        </div>
    </div>
</body>
</html>
'''
soup = bs(html, 'html.parser')
stats = {[s for s in i.stripped_strings][0]:i.b.text for i in soup.select('#statHolder [class^=Class]')}
print(stats)
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Is there any way to integrate selenium in this code so it works with dynamic webpage? Or download the html file and work upon it? – AwesomeSam Jan 10 '21 at 07:04
  • you would still need to manually bypass the captcha but feel free to open a question regarding automating with selenium and explain where you have got to, – QHarr Jan 10 '21 at 07:08
  • Does that captcha also blocks me to download the webpage as `.html`? Because after that your code is just Perfect! – AwesomeSam Jan 10 '21 at 07:12
  • for selenium you could transfer driver.page_source to bs4 to parse the html. – QHarr Jan 10 '21 at 07:12
  • I wouldn't imagine yes unless they provide an API or some authentication method you would need to integrate into the code. – QHarr Jan 10 '21 at 07:13
  • Ok sure.. I ll look for that too – AwesomeSam Jan 10 '21 at 07:14
  • And I meant complete, rather than bypass, the capture – QHarr Jan 10 '21 at 07:14
  • Hey I tried working with selenium. First, its a lot complicated. And next, will I be able to host it on heroku and it will still work? Please help me I just want to get simple stats :( – AwesomeSam Jan 10 '21 at 09:22
  • @QHarr Can you explain this amazing line `stats = {[s for s in i.stripped_strings][0]:i.b.text for i in soup.select('#statHolder [class^=Class]')}`. Can you split it into parts to try to get the whole idea? – YasserKhalil Jan 12 '21 at 11:06
  • 1
    @YasserKhalil Please see this: https://pastebin.com/4EGQY7hL – QHarr Jan 12 '21 at 16:45