19
<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 </span>

How to print "I Like your face" instead of "I Like to punch your face"

I tried this

lala = soup.find_all('span')
for p in lala:
 if not p.find(class_='unwanted'):
    print p.text

but it give "TypeError: find() takes no keyword arguments"

masbro
  • 354
  • 1
  • 3
  • 12

2 Answers2

22

You can use extract() to remove unwanted tag before you get text.

But it keeps all '\n' and spaces so you will need some work to remove them.

data = '''<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 <span>'''

from bs4 import BeautifulSoup as BS

soup = BS(data, 'html.parser')

external_span = soup.find('span')

print("1 HTML:", external_span)
print("1 TEXT:", external_span.text.strip())

unwanted = external_span.find('span')
unwanted.extract()

print("2 HTML:", external_span)
print("2 TEXT:", external_span.text.strip())

Result

1 HTML: <span>
  I Like
  <span class="unwanted"> to punch </span>
   your face
 <span></span></span>
1 TEXT: I Like
   to punch 
   your face
2 HTML: <span>
  I Like

   your face
 <span></span></span>
2 TEXT: I Like

   your face

You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML).

data = '''<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 <span>'''

from bs4 import BeautifulSoup as BS
import bs4

soup = BS(data, 'html.parser')

external_span = soup.find('span')

text = []
for x in external_span:
    if isinstance(x, bs4.element.NavigableString):
        text.append(x.strip())
print(" ".join(text))

Result

I Like your face
furas
  • 134,197
  • 12
  • 106
  • 148
  • extract() works but only if u have only one unwanted. What if I have 2 unwanted class tag? – masbro Nov 23 '16 at 11:10
  • 1
    `extract()` remove only one element but if you find more elements then you can use it with every element - for example in for-loop. – furas Nov 23 '16 at 11:13
  • Is there a way to do this which doesn't assume the file is small enough to completely read into memory unless the tag we want to exclude is actually excluded? Like one that selectively excludes chars between certain tags? or maybe that reads in chunks? – Rob Truxal Nov 06 '18 at 02:49
  • Old question but I'm sure someone will end up here in the future (like me). If you need to extract more than one, use find_all and a for loop: `for junk in external_span.find_all('span'): junk.extract()` – cowsay May 05 '22 at 04:40
2

You can easily find the (un)desired text like this:

from bs4 import BeautifulSoup

text = """<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 <span>"""
soup = BeautifulSoup(text, "lxml")
for i in soup.find_all("span"):
    if 'class' in i.attrs:
        if "unwanted" in i.attrs['class']:
            print(i.text)

From here outputting everything else can be easily done

Gábor Erdős
  • 3,599
  • 4
  • 24
  • 56