1

I'm trying to pull some text from a website called Elite Prospects (https://www.eliteprospects.com/team/41/jokerit). Here is the source code from the page:

<div class="semi-logo">
    Jokerit
            <small>
            <span>
                <i> <img class="nation-flag" src="//files.eliteprospects.com/layout/flagsmedium/9.png"> </i>
                <a href="https://www.eliteprospects.com/league/khl">KHL</a>
            </span>
        </small>
    </div>            

I'm specifically trying to pull the team name (in this example it is "Jokerit"), and the league name located in the a href tag. I'm successfull able to pull the league name, but the way I am trying to pull the team name gives me "JokeritKHL". I'm using this code for multiple examples so it needs to be able to pull a two worded team name as well (for example "Guelph Storm").

Here is my code:

team_logo= scraper.find(class_='semi-logo')
team_name = team_logo.getText(strip=True)
league = team_logo.find('a')
league = league.getText()
print(league)
print(team_name)

And here is the current output I'm getting:

KHL
JokeritKHL

Any ideas?

What I'm trying to find out is there a way to only get the "Jokerit" part

DevesH
  • 486
  • 4
  • 18
SD_23
  • 401
  • 2
  • 11
  • Original answer - https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup – DevesH Sep 05 '19 at 17:14

3 Answers3

1

You could use .find() for this, as follows:

from bs4 import BeautifulSoup

my_html = """
<div class="semi-logo">
    Jokerit
            <small>
            <span>
                <i> <img class="nation-flag" src="//files.eliteprospects.com/layout/flagsmedium/9.png"> </i>
                <a href="https://www.eliteprospects.com/league/khl">KHL</a>
            </span>
        </small>
    </div>  
"""

soup = BeautifulSoup(my_html, 'lxml')

extracted_text = soup.div.find(text=True)
print(extracted_text.strip())

If you look at soup.div.children, you'll see that there are three direct descendant elements in the tag: the text before the tag, the tag (and its content), and finally a some more text since in this case there's a \n at the end. So this is just returning the elements that are text. The .strip gets rid of the extra whitespace.

Bill M.
  • 1,388
  • 1
  • 8
  • 16
0

team_name = team_logo.getText(strip=True). This returns all text under class semi-logo hierarchy. Therefore you are getting Jokerit + KHL.

DevesH
  • 486
  • 4
  • 18
  • Hi @Dave123 Thank you for the new response. I understand that "team_name = team_logo.getText(strip=True). This returns all text under class semi-logo hierarchy. Therefore you are getting Jokerit + KHL." What I'm trying to find out is there a way to only get the "Jokerit" part – SD_23 Sep 05 '19 at 17:52
0

They can also be grabbed easily by regex of string

import requests, re

urls = ['https://www.eliteprospects.com/team/552/guelph-storm','https://www.eliteprospects.com/team/41/jokerit']
p = re.compile(r"sv2: '(.*)'")
with requests.Session() as s:
    for url in urls:
        r = s.get(url)
        print(p.findall(r.text)[0])
QHarr
  • 83,427
  • 12
  • 54
  • 101