207

I have a text like this:

text = """<div>
<h1>Title</h1>
<p>A long text........ </p>
<a href=""> a link </a>
</div>"""

using pure Python, with no external module I want to have this:

>>> print remove_tags(text)
Title A long text..... a link

I know I can do it using lxml.html.fromstring(text).text_content() but I need to achieve the same in pure Python using builtin or std library for 2.6+

How can I do that?

obmarg
  • 9,369
  • 36
  • 59

5 Answers5

425

Using a regex

Using a regex, you can clean everything inside <> :

import re
# as per recommendation from @freylis, compile once only
CLEANR = re.compile('<.*?>') 

def cleanhtml(raw_html):
  cleantext = re.sub(CLEANR, '', raw_html)
  return cleantext

Some HTML texts can also contain entities that are not enclosed in brackets, such as '&nsbm'. If that is the case, then you might want to write the regex as

CLEANR = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')

This link contains more details on this.

Using BeautifulSoup

You could also use BeautifulSoup additional package to find out all the raw text.

You will need to explicitly set a parser when calling BeautifulSoup I recommend "lxml" as mentioned in alternative answers (much more robust than the default one (html.parser) (i.e. available without additional install).

from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html, "lxml").text

But it doesn't prevent you from using external libraries, so I recommend the first solution.

EDIT: To use lxml you need to pip install lxml.

c24b
  • 5,278
  • 6
  • 27
  • 35
  • 14
    if you want to compile regexp, best way is compile outside function. In you exemple every call `cleanhtml` must be compile regexp again – freylis Jun 20 '14 at 02:35
  • 7
    BeautifulSoup is good when the markup is heavy, else try to avoid it as it's very slow. – Ethan Jun 12 '15 at 12:48
  • 1
    Great answer. You forgot the colon at the end of `def cleanhtml(raw_html)` though :) – bjesus Sep 26 '16 at 18:29
  • FWIW, this will also remove XML another XHTML tags, too. – blacksite Jun 01 '17 at 19:11
  • if you want to remove specific tags only you could either .extract() them to remove tag+content or .clean() it to remove only the tag... There is no syntaxic difference between a html tag and and xml tag. – c24b Jun 08 '17 at 13:43
  • 4
    Nice answer. You might want to explicitly set your parser in BeautifulSoup, using `cleantext = BeautifulSoup(raw_html, "html.parser").text` – Zemogle Dec 06 '17 at 16:32
  • what's the importance of compiling the pattern if we can just pass it as a string to re.sub() ? – MAltakrori Jun 04 '18 at 21:30
  • re.sub() is a regex function that requires a patern to match every kind of tags. Backward it compiles it... – c24b Jun 06 '18 at 00:00
  • '<.*?>' can anyone explain what it will do. – Sargam Modak Aug 25 '18 at 13:27
  • . any character * multiple time ? optional characters (non greedy for parser means it is not obliged to find one) basically what this expression does is to concentrate on the formal caracter that express a tag i.e < and > matching everithing that has < and > – c24b Mar 04 '19 at 10:07
  • Compiling the regex is not needed since it is cached automatically. – Jakub Bláha Jul 10 '19 at 20:07
  • regex fail when use w3c standar, by example: `` or `` or `&lt; bad header bad footer &gt;` or `&lt;\s` or `&lt;%`, etc.</plaintext></span> –&nbsp;<a href="../../users/1243068/e-info128" title="3,727 reputation" class="comment-user ">e-info128</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment102667443_12982689"><span title="2019-09-27T19:46:59.010 License: CC BY-SA 4.0" class="relativetime-clean">Sep 27 '19 at 19:46</span></a></span> </div> </div> </li> <li id="comment-103007863" class="comment js-comment " data-comment-id="103007863" data-comment-owner-id="3746632" data-comment-score="1"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> <span title="number of 'useful comment' votes received" class="warm">1</span> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment103007863_12982689"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">The first half of this answer should be removed because it is terribly wrong to try this. HTML needs to be parsed as a tree and understood that `<script>` and other tags can contain anything. I say this with the politest regard and c24b acknowledged this.</script></span> –&nbsp;<a href="../../users/3746632/ldmtwo" title="419 reputation" class="comment-user ">ldmtwo</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment103007863_12982689"><span title="2019-10-10T13:56:49.293 License: CC BY-SA 4.0" class="relativetime-clean">Oct 10 '19 at 13:56</span></a></span> </div> </div> </li> <li id="comment-115750091" class="comment js-comment " data-comment-id="115750091" data-comment-owner-id="3947461" data-comment-score="0"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment115750091_12982689"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">"Couldn't find a tree builder with the features you requested: lxml"</span> –&nbsp;<a href="../../users/3947461/rodrigo-vieira" title="312 reputation" class="comment-user ">Rodrigo Vieira</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment115750091_12982689"><span title="2020-12-27T22:18:49.183 License: CC BY-SA 4.0" class="relativetime-clean">Dec 27 '20 at 22:18</span></a></span> </div> </div> </li> <li id="comment-116796764" class="comment js-comment " data-comment-id="116796764" data-comment-owner-id="658060" data-comment-score="0"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment116796764_12982689"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">Install python-lxml to find the tree builder</span> –&nbsp;<a href="../../users/658060/c24b" title="5,278 reputation" class="comment-user ">c24b</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment116796764_12982689"><span title="2021-02-05T09:24:35.093 License: CC BY-SA 4.0" class="relativetime-clean">Feb 05 '21 at 09:24</span></a></span> </div> </div> </li> </ul> </div> </div> </div> </div> <a name="9662410"></a> <div id="answer-9662410" class="answer " data-answerid="9662410" data-ownerid="779200" data-score="49" itemprop="suggestedAnswer" itemscope="" itemtype="https://schema.org/Answer"> <div class="post-layout"> <div class="votecell post-layout--left"> <div class="js-voting-container grid jc-center fd-column ai-stretch gs4 fc-black-200" data-post-id="9662410"> <button class="js-vote-up-btn grid--cell s-btn s-btn__unset c-pointer"><svg aria-hidden="true" class="m0 svg-icon iconArrowUpLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 26h32L18 10 2 26z"></path></svg></button> <div class="js-vote-count grid--cell fc-black-500 fs-title grid fd-column ai-center" itemprop="upvoteCount" data-value="49">49</div> </div> </div> <div class="postcell post-layout--right"> <div class="s-prose js-post-body" itemprop="text"><p>Python has several XML modules built in. The simplest one for the case that you already have a string with the full HTML is <a class="external-link" href="http://docs.python.org/library/xml.etree.elementtree.html" rel="noreferrer"><code>xml.etree</code></a>, which works (somewhat) similarly to the lxml example you mention:</p> <pre class="lang-py prettyprint-override"><code>def remove_tags(text): return ''.join(xml.etree.ElementTree.fromstring(text).itertext()) </code></pre></div> <div class="mb0"> <div class="mt16 grid gs8 gsy fw-wrap jc-end ai-start pt4 mb16"> <div class="grid--cell mr16 fl1 w96"></div> <div class="post-signature grid--cell"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="edited Jun 06 '22 at 10:41">edited Jun 06 '22 at 10:41</time> <a href="../../users/12491345/pdaawr" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/12491345.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="pdaawr" /> </a> <div class="s-user-card--info"> <a href="../../users/12491345/pdaawr" class="s-user-card--link">pdaawr</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">436</li> <li class="s-award-bling s-award-bling__silver" title="7 silver badges">7</li> <li class="s-award-bling s-award-bling__bronze" title="16 bronze badges">16</li> </ul> </div> </div> </div> <div class="post-signature grid--cell"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="answered Mar 12 '12 at 06:04">answered Mar 12 '12 at 06:04</time> <a href="../../users/779200/lvc" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/779200.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="lvc" /> </a> <div class="s-user-card--info"> <a href="../../users/779200/lvc" class="s-user-card--link">lvc</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">34,233</li> <li class="s-award-bling s-award-bling__gold" title="10 gold badges">10</li> <li class="s-award-bling s-award-bling__silver" title="73 silver badges">73</li> <li class="s-award-bling s-award-bling__bronze" title="98 bronze badges">98</li> </ul> </div> </div> </div> </div> </div> </div> <div class="post-layout--right js-post-comments-component"> <div id="comments-9662410" class="comments js-comments-container bt bc-black-075 mt12 " data-post-id="9662410" data-min-length="15"> <ul class="comments-list js-comments-list" data-remaining-comments-count="0" data-canpost="false" data-cansee="true" data-comments-unavailable="false" data-addlink-disabled="true"> <li id="comment-107277885" class="comment js-comment " data-comment-id="107277885" data-comment-owner-id="2398782" data-comment-score="3"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> <span title="number of 'useful comment' votes received" class="warm">3</span> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment107277885_9662410"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">This worked for me but be carefull of the html tags from autoclose type. Example : I got a "ParseError: mismatched tag: line 1, column 9" cause this tag is close without being open before. This is the same for all html tags autoclosed.</span> –&nbsp;<a href="../../users/2398782/1ronmat" title="1,147 reputation" class="comment-user ">1ronmat</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment107277885_9662410"><span title="2020-03-11T13:17:21.630 License: CC BY-SA 4.0" class="relativetime-clean">Mar 11 '20 at 13:17</span></a></span> </div> </div> </li> </ul> </div> </div> </div> </div> <a name="9662362"></a> <div id="answer-9662362" class="answer " data-answerid="9662362" data-ownerid="148870" data-score="40" itemprop="suggestedAnswer" itemscope="" itemtype="https://schema.org/Answer"> <div class="post-layout"> <div class="votecell post-layout--left"> <div class="js-voting-container grid jc-center fd-column ai-stretch gs4 fc-black-200" data-post-id="9662362"> <button class="js-vote-up-btn grid--cell s-btn s-btn__unset c-pointer"><svg aria-hidden="true" class="m0 svg-icon iconArrowUpLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 26h32L18 10 2 26z"></path></svg></button> <div class="js-vote-count grid--cell fc-black-500 fs-title grid fd-column ai-center" itemprop="upvoteCount" data-value="40">40</div> </div> </div> <div class="postcell post-layout--right"> <div class="s-prose js-post-body" itemprop="text"><p>Note that this isn't perfect, since if you had something like, say, <code>&lt;a title="&gt;"&gt;</code> it would break. However, it's about the closest you'd get in non-library Python without a really complex function:</p> <pre><code>import re TAG_RE = re.compile(r'&lt;[^&gt;]+&gt;') def remove_tags(text): return TAG_RE.sub('', text) </code></pre> <p>However, as lvc mentions <a class="external-link" href="http://docs.python.org/library/xml.etree.elementtree.html" rel="noreferrer"><code>xml.etree</code></a> is available in the Python Standard Library, so you could probably just adapt it to serve like your existing <code>lxml</code> version:</p> <pre><code>def remove_tags(text): return ''.join(xml.etree.ElementTree.fromstring(text).itertext()) </code></pre></div> <div class="mb0"> <div class="mt16 grid gs8 gsy fw-wrap jc-end ai-start pt4 mb16"> <div class="grid--cell mr16 fl1 w96"></div> <div class="post-signature grid--cell"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="edited Oct 16 '17 at 15:59">edited Oct 16 '17 at 15:59</time> <a href="../../users/464744/blender" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/464744.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Blender" /> </a> <div class="s-user-card--info"> <a href="../../users/464744/blender" class="s-user-card--link">Blender</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">289,723</li> <li class="s-award-bling s-award-bling__gold" title="53 gold badges">53</li> <li class="s-award-bling s-award-bling__silver" title="439 silver badges">439</li> <li class="s-award-bling s-award-bling__bronze" title="496 bronze badges">496</li> </ul> </div> </div> </div> <div class="post-signature grid--cell"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="answered Mar 12 '12 at 05:57">answered Mar 12 '12 at 05:57</time> <a href="../../users/148870/amber" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/148870.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Amber" /> </a> <div class="s-user-card--info"> <a href="../../users/148870/amber" class="s-user-card--link">Amber</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">507,862</li> <li class="s-award-bling s-award-bling__gold" title="82 gold badges">82</li> <li class="s-award-bling s-award-bling__silver" title="626 silver badges">626</li> <li class="s-award-bling s-award-bling__bronze" title="550 bronze badges">550</li> </ul> </div> </div> </div> </div> </div> </div> <div class="post-layout--right js-post-comments-component"> <div id="comments-9662362" class="comments js-comments-container bt bc-black-075 mt12 " data-post-id="9662362" data-min-length="15"> <ul class="comments-list js-comments-list" data-remaining-comments-count="0" data-canpost="false" data-cansee="true" data-comments-unavailable="false" data-addlink-disabled="true"> <li id="comment-12271984" class="comment js-comment " data-comment-id="12271984" data-comment-owner-id="361638" data-comment-score="2"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> <span title="number of 'useful comment' votes received" class="warm">2</span> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment12271984_9662362"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">I like your regex approach, maybe it will be better if performance's an important factor.</span> –&nbsp;<a href="../../users/361638/douglas-camata" title="598 reputation" class="comment-user ">Douglas Camata</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment12271984_9662362"><span title="2012-03-12T06:27:45.770 License: CC BY-SA 3.0" class="relativetime-clean">Mar 12 '12 at 06:27</span></a></span> </div> </div> </li> <li id="comment-39180606" class="comment js-comment " data-comment-id="39180606" data-comment-owner-id="1919237" data-comment-score="0"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment39180606_9662362"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">And in addition, it works with strings not starting with an xml tag, it that would be the case</span> –&nbsp;<a href="../../users/1919237/kiril" title="4,914 reputation" class="comment-user ">kiril</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment39180606_9662362"><span title="2014-08-06T16:41:17.213 License: CC BY-SA 3.0" class="relativetime-clean">Aug 06 '14 at 16:41</span></a></span> </div> </div> </li> <li id="comment-45504820" class="comment js-comment " data-comment-id="45504820" data-comment-owner-id="1287834" data-comment-score="0"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment45504820_9662362"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">@DouglasCamata regex is not more performant than an xml parser.</span> –&nbsp;<a href="../../users/1287834/slater-victoroff" title="21,376 reputation" class="comment-user ">Slater Victoroff</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment45504820_9662362"><span title="2015-02-19T05:43:10.293 License: CC BY-SA 3.0" class="relativetime-clean">Feb 19 '15 at 05:43</span></a></span> </div> </div> </li> <li id="comment-45745565" class="comment js-comment " data-comment-id="45745565" data-comment-owner-id="361638" data-comment-score="0"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment45745565_9662362"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">@SlaterTyranus that depends on the xml parser and regex implementation. I guess both use C extensions... but do you have any benchmarks for us to see?</span> –&nbsp;<a href="../../users/361638/douglas-camata" title="598 reputation" class="comment-user ">Douglas Camata</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment45745565_9662362"><span title="2015-02-25T21:11:52.013 License: CC BY-SA 3.0" class="relativetime-clean">Feb 25 '15 at 21:11</span></a></span> </div> </div> </li> <li id="comment-45772891" class="comment js-comment " data-comment-id="45772891" data-comment-owner-id="1287834" data-comment-score="0"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment45772891_9662362"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">@DouglasCamata It's hard because the two aren't really comparable. It's easy to come up with toy examples where `regex` will outperform a real parser, but these lxml benchmarks are a good indicator of real-world performance http://www.ibm.com/developerworks/library/x-hiperfparse/</span> –&nbsp;<a href="../../users/1287834/slater-victoroff" title="21,376 reputation" class="comment-user ">Slater Victoroff</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment45772891_9662362"><span title="2015-02-26T14:37:42.857 License: CC BY-SA 3.0" class="relativetime-clean">Feb 26 '15 at 14:37</span></a></span> </div> </div> </li> <li id="comment-45772965" class="comment js-comment " data-comment-id="45772965" data-comment-owner-id="1287834" data-comment-score="3"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> <span title="number of 'useful comment' votes received" class="warm">3</span> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment45772965_9662362"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">It's worth noting that this will break if you have a text `&lt;` in your document.</span> –&nbsp;<a href="../../users/1287834/slater-victoroff" title="21,376 reputation" class="comment-user ">Slater Victoroff</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment45772965_9662362"><span title="2015-02-26T14:39:22.747 License: CC BY-SA 3.0" class="relativetime-clean">Feb 26 '15 at 14:39</span></a></span> </div> </div> </li> <li id="comment-109117934" class="comment js-comment " data-comment-id="109117934" data-comment-owner-id="1457380" data-comment-score="0"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment109117934_9662362"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">For the `xml.etree` solution, I get `NameError: name 'xml' is not defined`, any ideas?</span> –&nbsp;<a href="../../users/1457380/patrickt" title="10,037 reputation" class="comment-user ">PatrickT</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment109117934_9662362"><span title="2020-05-08T22:00:55.933 License: CC BY-SA 4.0" class="relativetime-clean">May 08 '20 at 22:00</span></a></span> </div> </div> </li> <li id="comment-109120923" class="comment js-comment " data-comment-id="109120923" data-comment-owner-id="148870" data-comment-score="1"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> <span title="number of 'useful comment' votes received" class="warm">1</span> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment109120923_9662362"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">@PatrickT you need to export it - `import xml.etree`</span> –&nbsp;<a href="../../users/148870/amber" title="507,862 reputation" class="comment-user ">Amber</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment109120923_9662362"><span title="2020-05-09T01:19:06.867 License: CC BY-SA 4.0" class="relativetime-clean">May 09 '20 at 01:19</span></a></span> </div> </div> </li> <li id="comment-109121789" class="comment js-comment " data-comment-id="109121789" data-comment-owner-id="1457380" data-comment-score="0"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment109121789_9662362"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">@Amber, thanks! I misunderstood the meaning of "available in the Python Standard Library" as always available.</span> –&nbsp;<a href="../../users/1457380/patrickt" title="10,037 reputation" class="comment-user ">PatrickT</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment109121789_9662362"><span title="2020-05-09T02:37:17.307 License: CC BY-SA 4.0" class="relativetime-clean">May 09 '20 at 02:37</span></a></span> </div> </div> </li> <li id="comment-112786571" class="comment js-comment " data-comment-id="112786571" data-comment-owner-id="5729119" data-comment-score="0"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment112786571_9662362"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">a slight improvement to the original answer can be `re.sub(&lt;[^&lt;]+?&gt;, '', text)`</span> –&nbsp;<a href="../../users/5729119/user54211" title="121 reputation" class="comment-user ">User54211</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment112786571_9662362"><span title="2020-09-07T17:00:05.537 License: CC BY-SA 4.0" class="relativetime-clean">Sep 07 '20 at 17:00</span></a></span> </div> </div> </li> </ul> </div> </div> </div> </div> <a name="14464496"></a> <div id="answer-14464496" class="answer " data-answerid="14464496" data-ownerid="1402286" data-score="9" itemprop="suggestedAnswer" itemscope="" itemtype="https://schema.org/Answer"> <div class="post-layout"> <div class="votecell post-layout--left"> <div class="js-voting-container grid jc-center fd-column ai-stretch gs4 fc-black-200" data-post-id="14464496"> <button class="js-vote-up-btn grid--cell s-btn s-btn__unset c-pointer"><svg aria-hidden="true" class="m0 svg-icon iconArrowUpLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 26h32L18 10 2 26z"></path></svg></button> <div class="js-vote-count grid--cell fc-black-500 fs-title grid fd-column ai-center" itemprop="upvoteCount" data-value="9">9</div> </div> </div> <div class="postcell post-layout--right"> <div class="s-prose js-post-body" itemprop="text"><p>There's a simple way to this in any C-like language. The style is not Pythonic but works with pure Python:</p> <pre><code>def remove_html_markup(s): tag = False quote = False out = "" for c in s: if c == '&lt;' and not quote: tag = True elif c == '&gt;' and not quote: tag = False elif (c == '"' or c == "'") and tag: quote = not quote elif not tag: out = out + c return out </code></pre> <p>The idea based in a simple finite-state machine and is detailed explained here: <a class="external-link" href="http://youtu.be/2tu9LTDujbw" rel="noreferrer">http://youtu.be/2tu9LTDujbw</a></p> <p>You can see it working here: <a class="external-link" href="http://youtu.be/HPkNPcYed9M?t=35s" rel="noreferrer">http://youtu.be/HPkNPcYed9M?t=35s</a></p> <p>PS - If you're interested in the class(about smart debugging with python) I give you a link: <a class="external-link" href="https://www.udacity.com/course/software-debugging--cs259" rel="noreferrer">https://www.udacity.com/course/software-debugging--cs259</a>. It's free! </p></div> <div class="mb0"> <div class="mt16 grid gs8 gsy fw-wrap jc-end ai-start pt4 mb16"> <div class="grid--cell mr16 fl1 w96"></div> <div class="post-signature grid--cell"> <div class="user-info "> <div class="user-action-time">edited <span title="2019-10-10T20:43:35.063" class="relativetime">Oct 10 '19 at 20:43</span></div> <div class="user-gravatar32"></div> <div class="user-details" itemprop="author" itemscope="" itemtype="http://schema.org/Person"> <span class="d-none" itemprop="name">Igor Medeiros</span> <div class="-flair"></div> </div> </div> </div> <div class="post-signature grid--cell"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="answered Jan 22 '13 at 17:27">answered Jan 22 '13 at 17:27</time> <a href="../../users/1402286/igor-medeiros" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/1402286.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Igor Medeiros" /> </a> <div class="s-user-card--info"> <a href="../../users/1402286/igor-medeiros" class="s-user-card--link">Igor Medeiros</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">4,026</li> <li class="s-award-bling s-award-bling__gold" title="2 gold badges">2</li> <li class="s-award-bling s-award-bling__silver" title="26 silver badges">26</li> <li class="s-award-bling s-award-bling__bronze" title="32 bronze badges">32</li> </ul> </div> </div> </div> </div> </div> </div> <div class="post-layout--right js-post-comments-component"> <div id="comments-14464496" class="comments js-comments-container bt bc-black-075 mt12 " data-post-id="14464496" data-min-length="15"> <ul class="comments-list js-comments-list" data-remaining-comments-count="0" data-canpost="false" data-cansee="true" data-comments-unavailable="false" data-addlink-disabled="true"> <li id="comment-33510473" class="comment js-comment " data-comment-id="33510473" data-comment-owner-id="1338797" data-comment-score="8"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> <span title="number of 'useful comment' votes received" class="warm">8</span> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment33510473_14464496"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">This will break on mismatched quotes, and is quite slow due to adding to the output character by character. But it ilustrates enough, that writing a primitive character-by-character parser isn't a big deal.</span> –&nbsp;<a href="../../users/1338797/tomasz-gandor" title="8,235 reputation" class="comment-user ">Tomasz Gandor</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment33510473_14464496"><span title="2014-02-28T11:28:20.367 License: CC BY-SA 3.0" class="relativetime-clean">Feb 28 '14 at 11:28</span></a></span> </div> </div> </li> <li id="comment-110991848" class="comment js-comment " data-comment-id="110991848" data-comment-owner-id="712526" data-comment-score="1"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> <span title="number of 'useful comment' votes received" class="warm">1</span> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment110991848_14464496"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">This answer is great for teaching HTML or Python, but misses a crucial point for production use: meeting standards is hard, and using a well-supported library can avoid weeks of research and/or bug-hunting in an otherwise healthy deadline.</span> –&nbsp;<a href="../../users/712526/jpaugh" title="6,634 reputation" class="comment-user ">jpaugh</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment110991848_14464496"><span title="2020-07-06T20:44:26.397 License: CC BY-SA 4.0" class="relativetime-clean">Jul 06 '20 at 20:44</span></a></span> </div> </div> </li> </ul> </div> </div> </div> </div> <a name="15063976"></a> <div id="answer-15063976" class="answer " data-answerid="15063976" data-ownerid="1899895" data-score="-13" itemprop="suggestedAnswer" itemscope="" itemtype="https://schema.org/Answer"> <div class="post-layout"> <div class="votecell post-layout--left"> <div class="js-voting-container grid jc-center fd-column ai-stretch gs4 fc-black-200" data-post-id="15063976"> <button class="js-vote-up-btn grid--cell s-btn s-btn__unset c-pointer"><svg aria-hidden="true" class="m0 svg-icon iconArrowUpLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 26h32L18 10 2 26z"></path></svg></button> <div class="js-vote-count grid--cell fc-black-500 fs-title grid fd-column ai-center" itemprop="upvoteCount" data-value="-13">-13</div> </div> </div> <div class="postcell post-layout--right"> <div class="s-prose js-post-body" itemprop="text"><pre><code>global temp temp ='' s = ' ' def remove_strings(text): global temp if text == '': return temp start = text.find('&lt;') end = text.find('&gt;') if start == -1 and end == -1 : temp = temp + text return temp newstring = text[end+1:] fresh_start = newstring.find('&lt;') if newstring[:fresh_start] != '': temp += s+newstring[:fresh_start] remove_strings(newstring[fresh_start:]) return temp </code></pre></div> <div class="mb0"> <div class="mt16 grid gs8 gsy fw-wrap jc-end ai-start pt4 mb16"> <div class="grid--cell mr16 fl1 w96"></div> <div class="post-signature grid--cell"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="edited Aug 14 '14 at 13:29">edited Aug 14 '14 at 13:29</time> <a href="../../users/237258/drachenfels" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/237258.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Drachenfels" /> </a> <div class="s-user-card--info"> <a href="../../users/237258/drachenfels" class="s-user-card--link">Drachenfels</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">3,037</li> <li class="s-award-bling s-award-bling__gold" title="2 gold badges">2</li> <li class="s-award-bling s-award-bling__silver" title="32 silver badges">32</li> <li class="s-award-bling s-award-bling__bronze" title="47 bronze badges">47</li> </ul> </div> </div> </div> <div class="post-signature grid--cell"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="answered Feb 25 '13 at 09:39">answered Feb 25 '13 at 09:39</time> <a href="../../users/1899895/user1899895" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/1899895.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="user1899895" /> </a> <div class="s-user-card--info"> <a href="../../users/1899895/user1899895" class="s-user-card--link">user1899895</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">67</li> <li class="s-award-bling s-award-bling__silver" title="1 silver badges">1</li> <li class="s-award-bling s-award-bling__bronze" title="5 bronze badges">5</li> </ul> </div> </div> </div> </div> </div> </div> <div class="post-layout--right js-post-comments-component"> <div id="comments-15063976" class="comments js-comments-container bt bc-black-075 mt12 " data-post-id="15063976" data-min-length="15"> <ul class="comments-list js-comments-list" data-remaining-comments-count="0" data-canpost="false" data-cansee="true" data-comments-unavailable="false" data-addlink-disabled="true"> <li id="comment-39448485" class="comment js-comment " data-comment-id="39448485" data-comment-owner-id="237258" data-comment-score="29"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> <span title="number of 'useful comment' votes received" class="warm">29</span> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment39448485_15063976"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">Your answer is: a) awfully formated (violates pep8 for example), b) overkill because there are tools to do the same, c) prone to fail (what happens when html has &gt; character in one of the attributes?), d) global in XXI century in such trivial case?</span> –&nbsp;<a href="../../users/237258/drachenfels" title="3,037 reputation" class="comment-user ">Drachenfels</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/9662346/python-code-to-remove-html-tags-from-a-string#comment39448485_15063976"><span title="2014-08-14T13:27:34.860 License: CC BY-SA 3.0" class="relativetime-clean">Aug 14 '14 at 13:27</span></a></span> </div> </div> </li> </ul> </div> </div> </div> </div> </div> </div> <div id="sidebar" class="show-votes" role="complementary" aria-label="sidebar"> <div class="module sidebar-linked"> <h4 id="h-linked">Linked</h4> <div class="linked"> <div class="spacer"> <a title="Vote score (upvotes - downvotes)"><div class="answer-votes default">1</div></a> <a href="../../questions/46746963/web-page-without-html-tables-into-dataframe-python" class="question-hyperlink">web page without html tables into dataframe python</a> </div> </div> <div class="linked"> <div class="spacer"> <a title="Vote score (upvotes - downvotes)"><div class="answer-votes default">1</div></a> <a href="../../questions/49395838/python-issue-with-unexpected-end-of-pattern" class="question-hyperlink">Python: Issue with 'unexpected end of pattern'</a> </div> </div> <div class="linked"> <div class="spacer"> <a title="Vote score (upvotes - downvotes)"><div class="answer-votes answered-accepted default">0</div></a> <a href="../../questions/57210550/remove-encoded-html-tags-from-large-string-in-python" class="question-hyperlink">Remove encoded HTML tags from large string in Python</a> </div> </div> <div class="linked"> <div class="spacer"> <a title="Vote score (upvotes - downvotes)"><div class="answer-votes answered-accepted default">1</div></a> <a href="../../questions/59945689/extracting-text-from-a-succession-of-strings-enclosed-in-html-tags-and-strings" class="question-hyperlink">Extracting text from a succession of strings enclosed in HTML tags and strings without tags</a> </div> </div> <div class="linked"> <div class="spacer"> <a title="Vote score (upvotes - downvotes)"><div class="answer-votes answered-accepted default">0</div></a> <a href="../../questions/60450058/regex-is-not-removing-websites-from-text-data-in-preprocessing" class="question-hyperlink">Regex is not removing websites from text data in preprocessing</a> </div> </div> <div class="linked"> <div class="spacer"> <a title="Vote score (upvotes - downvotes)"><div class="answer-votes default">0</div></a> <a href="../../questions/63811591/removing-editing-html-tags-from-local-file" class="question-hyperlink">Removing/editing HTML tags from local file</a> </div> </div> <div class="linked"> <div class="spacer"> <a title="Vote score (upvotes - downvotes)"><div class="answer-votes default">1</div></a> <a href="../../questions/66926391/python-replacing-values-in-a-list" class="question-hyperlink">Python Replacing Values in a List</a> </div> </div> <div class="linked"> <div class="spacer"> <a title="Vote score (upvotes - downvotes)"><div class="answer-votes default">0</div></a> <a href="../../questions/69168899/wordpress-how-to-get-advanced-custom-field-content-using-python-and-the-rest-a" class="question-hyperlink">Wordpress how to get advanced custom field content using python and the REST API</a> </div> </div> <div class="linked"> <div class="spacer"> <a title="Vote score (upvotes - downvotes)"><div class="answer-votes answered-accepted default">0</div></a> <a href="../../questions/21971646/strip-the-html-tags-and-return-only-text-using-mechanize-in-python" class="question-hyperlink">strip the html tags and return only text using mechanize in python</a> </div> </div> <div class="linked"> <div class="spacer"> <a title="Vote score (upvotes - downvotes)"><div class="answer-votes answered-accepted default">0</div></a> <a href="../../questions/23415318/how-to-strip-tags-from-html-with-robobrowser" class="question-hyperlink">How to strip tags from html with robobrowser</a> </div> </div> </div> </div> </div> </div> <script src="../../static/js/stack-icons.js"></script> <script src="../../static/js/fromnow.js"></script> </body> </html>