1

I encounter a problem, it might be very easy, but I didn't saw it on document.

Here is the target html structure, very simple.

<h3>Top 
    <em>Mid</em>
    <span>Down</span>
</h3> 

I want to get the "Top" text which was inside the h3 tag, and I wrote this

from bs4 import BeautifulSoup
html ="<h3>Top <em>Mid </em><span>Down</span></h3>"
soup = BeautifulSoup(html)
print soup.select("h3")[0].text

But it will return Top Mid Down, how do I modify it?

rj487
  • 4,476
  • 6
  • 47
  • 88

3 Answers3

1

You can use find setting text=True and recursive=False:

In [2]: from bs4 import BeautifulSoup
   ...: html ="<h3>Top <em>Mid </em><span>Down</span></h3>"
   ...: soup = BeautifulSoup(html,"html.parser")
   ...: print(soup.find("h3").find(text=True,recursive=False))
   ...: 
Top 

Depending on the format, there are lots of different ways:

print(soup.find("h3").contents[0])
print(next(soup.find("h3").children))
print(soup.find("h3").next)
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
0

Try something like this:

from bs4 import BeautifulSoup
html ="<h3>Top <em>Mid </em><span>Down</span></h3>"
soup = BeautifulSoup(html)
print soup.select("h3").findChildren()[0]

Though I am not entirely sure. Check this as well - How to find children of nodes using Beautiful Soup

Basically you need to hunt the first childNode.

Community
  • 1
  • 1
kawadhiya21
  • 2,458
  • 21
  • 34
-1

its easy for you to search using a regex something like this

 pageid=re.search('<h3>(.*?)</h3>', curPage, re.DOTALL)

and get the each of the data inside the tag using pageid.group(value) method

Midhun Mohan
  • 552
  • 5
  • 18
  • Thanks, but I thought there would be an easier way to get the content in BeautifulSoup. – rj487 Jul 25 '16 at 11:36