0

I have a string that I need to remove the characters in the string between two other strings.

At the moment I have the following code, I'm not too sure why it doesn't work.

def removeYoutube(itemDescription):
    itemDescription = re.sub('<iframe>.*</iframe>','',desc,flags=re.DOTALL)
    return itemDescription

It doesn't remove the string in between and including and .

Example Input (String):

"<div style="text-align: center;"><iframe allowfullscreen="frameborder=0" height="350" src="https://www.youtube.com/embed/EKaUJExxmEA" width="650"></iframe></div>"

Expected Output: <div style="text-align: center;"></div>

As you can see from the output it should remove all of the parts containing <iframe></iframe>.

baduker
  • 19,152
  • 9
  • 33
  • 56
Morgan
  • 71
  • 1
  • 2
  • 8

1 Answers1

1

Use BeautifulSoup not regex, as regex is a poor choice for parsing a HTML. Here's why.

Here's how:

from bs4 import BeautifulSoup

sample = """
<div style="text-align: center;"><iframe allowfullscreen="frameborder=0" height="350" src="https://www.youtube.com/embed/EKaUJExxmEA" width="650"></iframe></div>
"""

s = BeautifulSoup(sample, "html.parser")

for tag in s.find_all(True):
    if tag.name == "iframe":
        tag.extract()
print(s)

Output:

<div style="text-align: center;"></div>
baduker
  • 19,152
  • 9
  • 33
  • 56
  • 1
    thanks for the answer I don't know why I didn't think of that and thanks for linking the page as to why. Will be using this more in the future than using regex. Much appreciated :) – Morgan Feb 14 '21 at 15:01