0

I have a regex which scans through HTML strings and removes all tags but leaves the inner text content. The regex is as follows:

<([^>"]*|"[^"]*")*>

I have tested it thoroughly on http://regexr.com/ and it behaves the way I want it. But when I use

string ret = Regex.Replace(HTML, @"<([^>""]*|""[^""]*"")*>", "");

I'm still getting href tags in my string. I'm not sure what the difference between the regex in C# and regexr.com are.

P.S No I cannot use HtmlAgilityPack because it stack overflows when there are too many nested HTML elements, it happens frequently.

Example to test with: `

<description>
<!-- SC_OFF --><div class="md"><p>I was reading this comment threat about the upcoming Martian announcement. <a href="https://www.reddit.com/r/space/comments/3mjrdt/nasa_to_confirm_active_briny_water_flows_on_mars/cvfq50q">This comment</a> got me wondering.</p> <p>If you were in a decompression chamber and gradually decompressed (to avoid the bends), could you walk out onto the Martian surface with just an oxygen tank, provided that the surface was experiencing those balmy summer temperatures mentioned in the comment?</p> <p>I read The Martian recently, and I was thinking this possibility could have changed the whole book.</p> </div><!-- SC_ON --> submitted by <a href="https://www.reddit.com/user/jackwreid"> jackwreid </a> to <a href="https://www.reddit.com/r/askscience/"> askscience</a> <br/> <a href="https://www.reddit.com/r/askscience/comments/3mkson/given_time_to_decompress_slowly_could_a_human/">[link]</a> <a href="https://www.reddit.com/r/askscience/comments/3mkson/given_time_to_decompress_slowly_could_a_human/">[449 comments]</a>
</description>`

regexr.com correctly identifies the non inner html, but my C# gives me this, without the deletion of the correct elements:

`

Given time to decompress slowly, could a human
survive in a Martian summer with just a oxygen mask?asksciencehttps://www.reddit
.com/r/askscience/comments/3mkson/given_time_to_decompress_slowly_could_a_human/
https://www.reddit.com/r/askscience/comments/3mkson/given_time_to_decompress_slo
wly_could_a_human/Sun, 27 Sep 2015 14:02:23 +0000&lt;!-- SC_OFF --&gt;&lt;div cl
ass=&#34;md&#34;&gt;&lt;p&gt;I was reading this comment threat about the upcomin
g Martian announcement. &lt;a href=&#34;https://www.reddit.com/r/space/comments/
3mjrdt/nasa_to_confirm_active_briny_water_flows_on_mars/cvfq50q&#34;&gt;This com
ment&lt;/a&gt; got me wondering.&lt;/p&gt; &lt;p&gt;If you were in a decompressi
on chamber and gradually decompressed (to avoid the bends), could you walk out o
nto the Martian surface with just an oxygen tank, provided that the surface was
experiencing those balmy summer temperatures mentioned in the comment?&lt;/p&gt;
 &lt;p&gt;I read The Martian recently, and I was thinking this possibility could
 have changed the whole book.&lt;/p&gt; &lt;/div&gt;&lt;!-- SC_ON --&gt; submitt
ed by &lt;a href=&#34;https://www.reddit.com/user/jackwreid&#34;&gt; jackwreid &
lt;/a&gt; to &lt;a href=&#34;https://www.reddit.com/r/askscience/&#34;&gt; asksc
ience&lt;/a&gt; &lt;br/&gt; &lt;a href=&#34;https://www.reddit.com/r/askscience/
comments/3mkson/given_time_to_decompress_slowly_could_a_human/&#34;&gt;[link]&lt
;/a&gt; &lt;a href="https://www.reddit.com/r/askscience/comments/3mkson/given_ti
me_to_decompress_slowly_could_a_human/"&gt;[449 comments]

`

  • Could you give an example of simple wanted source --> result to validate the Regex ? – Joel Bourbonnais Sep 27 '15 at 21:41
  • 2
    This is why we don't use regex to parse html. It can never work. Use an html parser. – spender Sep 27 '15 at 21:44
  • 1
    You _cannot_ use Regex to parse HTML. Read the duplicate question / answer. http://stackoverflow.com/a/1732454/682404 – xxbbcc Sep 27 '15 at 21:48
  • You guys are coming to hasty conclusions, what I am trying to do is clearly possible, try the example on regexr. It looks like the issue is because the html is coming in as <a href...> instead of – David Parker Sep 27 '15 at 21:56

0 Answers0