1

Hey guys I'm trying to use beautifulSoup to get the content of a font tag. In the html page I'm parsing the tag I want to get the text from looks like:

<font color="#000000">Text I want to extract</font>

Going off another stackOverFlow question (how to extract text within font tag using beautifulsoup) I'm trying to use

html = urlopen(str(BASE_URL)).read()
soup = BeautifulSoup(html, "lxml")
info=soup('font', color="#000000")

print str(info)

but the print statement only returns []. Any idea what I'm doing wrong?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Pecans
  • 153
  • 1
  • 12
  • 1
    what does the html look like? Also it would be something like `print(soup.find("font").text)` to get text – Padraic Cunningham Feb 22 '15 at 20:46
  • 1
    The code you posted works just fine, which means that the HTML you downloaded either doesn't include such a tag at all or is so broken that the repaired version lxml produced eliminates the tag. – Martijn Pieters Feb 22 '15 at 20:51
  • The html seems extremely convoluted to me. Here's a screenshot of inspect element: http://imgur.com/nZ0ZAQ1 – Pecans Feb 22 '15 at 21:04
  • 1
    @Pecans: what your browser is served is not necessarily what `urlopen()` loads; servers are free to serve you different content based on the headers and cookies. Your browser can also be using JavaScript code to alter the page and load additional content asynchronously. Using `inspect` will show you the page *after* those changes. – Martijn Pieters Feb 22 '15 at 21:05
  • @Pecans: in other words, just because you see it in inspect in your browser doesn't mean it is *actually there* when loaded with Python. Look at the actual page source (view source) and check with the Network tab if any asynchronous requests are made. Check for JavaScript code that may be transforming the content. – Martijn Pieters Feb 22 '15 at 21:07
  • Thanks @Martijn Pieters! Looking at the source I found out that the text as actually linked from a different web page for some reason. Scanning that webpage instead allowed me to get what I needed! – Pecans Feb 22 '15 at 21:23

1 Answers1

2

Here you go:

from bs4 import BeautifulSoup

html = """<font color="#000000">Text I want to extract</font>"""

soup = BeautifulSoup(html, 'html.parser')

result1 = soup.find('font').text  # not specifying the color attribute
result2 = soup.find('font', {'color':'#000000'}).text  # specifying the color attribute

print result1  # prints 'Text I want to extract'
print result2  # prints 'Text I want to extract'
vadimhmyrov
  • 169
  • 7