Exclude unwanted tag on Beautifulsoup Python

Question

<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 </span>

How to print "I Like your face" instead of "I Like to punch your face"

I tried this

lala = soup.find_all('span')
for p in lala:
 if not p.find(class_='unwanted'):
    print p.text

but it give "TypeError: find() takes no keyword arguments"

you can try `extract()` to remove tag from HTML before you get text. — furas, Nov 23 '16 at 09:37
one of the most human friendly questions on stackoverflow :) — Leonard, Jul 17 '18 at 15:53

furas · Accepted Answer · 2016-11-23T09:50:34.047

You can use extract() to remove unwanted tag before you get text.

But it keeps all '\n' and spaces so you will need some work to remove them.

data = '''<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 <span>'''

from bs4 import BeautifulSoup as BS

soup = BS(data, 'html.parser')

external_span = soup.find('span')

print("1 HTML:", external_span)
print("1 TEXT:", external_span.text.strip())

unwanted = external_span.find('span')
unwanted.extract()

print("2 HTML:", external_span)
print("2 TEXT:", external_span.text.strip())

Result

1 HTML: <span>
  I Like
  <span class="unwanted"> to punch </span>
   your face
 <span></span></span>
1 TEXT: I Like
   to punch 
   your face
2 HTML: <span>
  I Like

   your face
 <span></span></span>
2 TEXT: I Like

   your face

You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML).

data = '''<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 <span>'''

from bs4 import BeautifulSoup as BS
import bs4

soup = BS(data, 'html.parser')

external_span = soup.find('span')

text = []
for x in external_span:
    if isinstance(x, bs4.element.NavigableString):
        text.append(x.strip())
print(" ".join(text))

Result

I Like your face

extract() works but only if u have only one unwanted. What if I have 2 unwanted class tag? — masbro, Nov 23 '16 at 11:10
`extract()` remove only one element but if you find more elements then you can use it with every element - for example in for-loop. — furas, Nov 23 '16 at 11:13
Is there a way to do this which doesn't assume the file is small enough to completely read into memory unless the tag we want to exclude is actually excluded? Like one that selectively excludes chars between certain tags? or maybe that reads in chunks? — Rob Truxal, Nov 06 '18 at 02:49
Old question but I'm sure someone will end up here in the future (like me). If you need to extract more than one, use find_all and a for loop: `for junk in external_span.find_all('span'): junk.extract()` — cowsay, May 05 '22 at 04:40

score 2 · Answer 2 · answered Nov 23 '16 at 09:50

You can easily find the (un)desired text like this:

from bs4 import BeautifulSoup

text = """<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 <span>"""
soup = BeautifulSoup(text, "lxml")
for i in soup.find_all("span"):
    if 'class' in i.attrs:
        if "unwanted" in i.attrs['class']:
            print(i.text)

From here outputting everything else can be easily done

Exclude unwanted tag on Beautifulsoup Python

2 Answers2

Linked

Related