BeautifulSoup4 - Getting incorrect text output with `getText()`

Question

I'm trying to pull some text from a website called Elite Prospects (https://www.eliteprospects.com/team/41/jokerit). Here is the source code from the page:

<div class="semi-logo">
    Jokerit
            <small>
            <span>
                <i> <img class="nation-flag" src="//files.eliteprospects.com/layout/flagsmedium/9.png"> </i>
                <a href="https://www.eliteprospects.com/league/khl">KHL</a>
            </span>
        </small>
    </div>

I'm specifically trying to pull the team name (in this example it is "Jokerit"), and the league name located in the a href tag. I'm successfull able to pull the league name, but the way I am trying to pull the team name gives me "JokeritKHL". I'm using this code for multiple examples so it needs to be able to pull a two worded team name as well (for example "Guelph Storm").

Here is my code:

team_logo= scraper.find(class_='semi-logo')
team_name = team_logo.getText(strip=True)
league = team_logo.find('a')
league = league.getText()
print(league)
print(team_name)

And here is the current output I'm getting:

KHL
JokeritKHL

Any ideas?

What I'm trying to find out is there a way to only get the "Jokerit" part

Original answer - https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup — DevesH, Sep 05 '19 at 17:14

Bill M. · Answer 1 · 2019-09-05T19:01:44.193

You could use .find() for this, as follows:

from bs4 import BeautifulSoup

my_html = """
<div class="semi-logo">
    Jokerit
            <small>
            <span>
                <i> <img class="nation-flag" src="//files.eliteprospects.com/layout/flagsmedium/9.png"> </i>
                <a href="https://www.eliteprospects.com/league/khl">KHL</a>
            </span>
        </small>
    </div>  
"""

soup = BeautifulSoup(my_html, 'lxml')

extracted_text = soup.div.find(text=True)
print(extracted_text.strip())

If you look at soup.div.children, you'll see that there are three direct descendant elements in the tag: the text before the tag, the tag (and its content), and finally a some more text since in this case there's a \n at the end. So this is just returning the elements that are text. The .strip gets rid of the extra whitespace.

DevesH · Answer 2 · 2019-09-05T17:43:49.603

0

team_name = team_logo.getText(strip=True). This returns all text under class semi-logo hierarchy. Therefore you are getting Jokerit + KHL.

edited Sep 05 '19 at 17:43

answered Sep 05 '19 at 17:24

DevesH

486
4
18

Hi @Dave123 Thank you for the new response. I understand that "team_name = team_logo.getText(strip=True). This returns all text under class semi-logo hierarchy. Therefore you are getting Jokerit + KHL." What I'm trying to find out is there a way to only get the "Jokerit" part – SD_23 Sep 05 '19 at 17:52

score 0 · Answer 3 · answered Sep 05 '19 at 20:58

They can also be grabbed easily by regex of string

import requests, re

urls = ['https://www.eliteprospects.com/team/552/guelph-storm','https://www.eliteprospects.com/team/41/jokerit']
p = re.compile(r"sv2: '(.*)'")
with requests.Session() as s:
    for url in urls:
        r = s.get(url)
        print(p.findall(r.text)[0])

BeautifulSoup4 - Getting incorrect text output with `getText()`

3 Answers3