26

I have links looks like this

<div class="systemRequirementsMainBox">
<div class="systemRequirementsRamContent">
<span title="000 Plus Minimum RAM Requirement">1 GB</span> </div>

I'm trying to get 1 GB from there. I tried

tt  = [a['title'] for a in soup.select(".systemRequirementsRamContent span")]
for ram in tt:
    if "RAM" in ram.split():
        print (soup.string)

It outputs None.

I tried a['text'] but it gives me KeyError. How can I fix this and what is my mistake?

GLHF
  • 3,835
  • 10
  • 38
  • 83

4 Answers4

23

You can use a css selector, pulling the span you want using the title text :

soup = BeautifulSoup("""<div class="systemRequirementsMainBox">
<div class="systemRequirementsRamContent">
<span title="000 Plus Minimum RAM Requirement">1 GB</span> </div>""", "xml")

print(soup.select_one("span[title*=RAM]").text)

That finds the span with a title attribute that contains RAM, it is equivalent to saying in python, if "RAM" in span["title"].

Or using find with re.compile

import re
print(soup.find("span", title=re.compile("RAM")).text)

To get all the data:

from bs4 import BeautifulSoup 
r  = requests.get("http://www.game-debate.com/games/index.php?g_id=21580&game=000%20Plus").content

soup = BeautifulSoup(r,"lxml")
cont = soup.select_one("div.systemRequirementsRamContent")
ram = cont.select_one("span")
print(ram["title"], ram.text)
for span in soup.select("div.systemRequirementsSmallerBox.sysReqGameSmallBox span"):
        print(span["title"],span.text)

Which will give you:

000 Plus Minimum RAM Requirement 1 GB
000 Plus Minimum Operating System Requirement Win Xp 32
000 Plus Minimum Direct X Requirement DX 9
000 Plus Minimum Hard Disk Drive Space Requirement 500 MB
000 Plus GD Adjusted Operating System Requirement Win Xp 32
000 Plus GD Adjusted Direct X Requirement DX 9
000 Plus GD Adjusted Hard Disk Drive Space Requirement 500 MB
000 Plus Recommended Operating System Requirement Win Xp 32
000 Plus Recommended Hard Disk Drive Space Requirement 500 MB
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Btw I tried `soup.select_one("span[title*=Space]").text` on this to get 500 MB`500 MB
    ` but it prints `HDD Space` that I don't understand why.
    – GLHF Jun 30 '16 at 22:10
  • can you add a link to the html/url? I think there may be another spans title that contains the text Space – Padraic Cunningham Jun 30 '16 at 22:11
  • Oh well that's because there are span tags with Disk and their texts are `HDD Space`. How can I point that I want only span tags in the game's link? – GLHF Jun 30 '16 at 22:13
  • I mean I have to point that this class `systemRequirementsSmallerBox sysReqGameSmallBox` which is holding the text I want. – GLHF Jun 30 '16 at 22:15
  • `soup.select_one("div.systemRequirementsSmallerBox.sysReqGameSmallBox span[title*=Space]")` but if you are finding multiple things in that first find `soup.select_one("div.systemRequirementsSmallerBox.sysReqGameSmallBox")` then do you select_ones from that – Padraic Cunningham Jun 30 '16 at 22:17
  • This is the link btw view-source:http://www.game-debate.com/games/index.php?g_id=21580&game=000%20Plus if you ctrl+f and search for `Disk` you'll see – GLHF Jun 30 '16 at 22:17
  • If you search for `Disk` 5th one I'm talking about, and many thanks for the help. – GLHF Jun 30 '16 at 22:19
  • No worries, do you specifically want the one or all the `div.systemRequirementsSmallerBox.sysReqGameSmallBox` spans? – Padraic Cunningham Jun 30 '16 at 22:21
  • I saw you edited your answer and that's really looks elegant. I should practice more CSS probably, thanks for the help. – GLHF Jun 30 '16 at 22:27
  • 1
    I used `soup = BeautifulSoup(r,"html.parser")` because I couldn't install lxml, but it gives me this error when I try to `print (span.text)` `AttributeError: 'NavigableString' object has no attribute 'text'` – GLHF Jun 30 '16 at 22:40
  • I ran the code with html.parser, are you running the code as posted? – Padraic Cunningham Jun 30 '16 at 22:44
  • Well I figured out, I used `select` instead of `select_one` and it's ok now. – GLHF Jun 30 '16 at 22:46
  • 1
    Cool, no worries. As far as css goes, https://developer.mozilla.org/en/docs/Web/Guide/CSS/Getting_started/Selectors shows you a lot of what you can use in bs4, the only pseudo-class implemented is nth-of-type but *= ^= > + ~ etc.. are available to use. – Padraic Cunningham Jun 30 '16 at 22:50
  • used "html.parser" instead of "xml" did the job. – 404rorre Apr 09 '23 at 09:21
6

I tried to extract the text inside all the span tags inside the HTML document using find_all() function from bs4 (BeautifulSoup):

from bs4 import BeautifulSoup
import requests
url="YOUR_URL_HERE"
response=requests.get(url)
soup=BeautifulSoup(response.content,html5lib)
spans=soup.find_all('span',"ENTER_Css_CLASS_HERE")
for span in spans:
  print(span.text)
סטנלי גרונן
  • 2,917
  • 23
  • 46
  • 68
Swapnil Chopra
  • 591
  • 5
  • 9
1

You can simply use span tag in BeautifulSoup or you can include other attributes like class, title along with the span tag.

from BeautifulSoup import BeautifulSoup as BSHTML

htmlText = """<div class="systemRequirementsMainBox">
<div class="systemRequirementsRamContent">
<span title="000 Plus Minimum RAM Requirement">1 GB</span> </div>"""

soup = BSHTML(htmlText)
spans = soup.findAll('span')
# spans = soup.findAll('span', attrs = {'class' : 'your-class-name'}) # or span by class name
# spans = soup.findAll('span', attrs = {'title' : '000 Plus Minimum RAM Requirement'}) # or span with a title
for span in spans:
    print span.text
Abu Shoeb
  • 4,747
  • 2
  • 40
  • 45
  • Why need to initiate a for loop for Beuatifulshop resultset even there is only one value? – Gautam Shahi Mar 31 '20 at 22:02
  • @GautamShahi you don't need it for the given example. However, I keep it in general in case you have other span values that you need. – Abu Shoeb Apr 01 '20 at 02:12
1

You could solve this with just a couple lines of gazpacho:

from gazpacho import Soup

html = """\
<div class="systemRequirementsMainBox">
<div class="systemRequirementsRamContent">
<span title="000 Plus Minimum RAM Requirement">1 GB</span> </div>
"""

soup = Soup(html)
soup.find("span", {"title": "Minimum RAM Requirement"}).text
# '1 GB'
emehex
  • 9,874
  • 10
  • 54
  • 100