1

I have a piece of html code that I want to wipe out some style parts, I know I need to regex but I don't know how to generate the regex or even how to apply it in my c# code. Below is the sample of original string:

<p style="color: #000000; text-transform: none; letter-spacing: normal; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; word-spacing: 0px; white-space: normal; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px;">

And here is the output that I wish to get after a replace operation:

<p> 

I want to get rid of the style attribute. And I need to do this for all occurances of <p ...>

There exist tons of examples about this kind of jobs, but I really got confused about this. So any clue on solution would be great. Thanks in advance.

Tolga Evcimen
  • 7,112
  • 11
  • 58
  • 91
  • 2
    Check the accepted answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags - consider using an HTML Parser rather than regex – Jamiec Aug 14 '13 at 09:30

2 Answers2

3

You really find a regex tutorial (example) to learn how matches work, then replacements will be easier...

string output = Regex.Replace(input, @"(?<=<p)[^>]+", "");

See demo.

To remove only the style attribute, you could perhaps use this:

string output = Regex.Replace(input, @"(?<=<p)\s*style=""[^""]+""", "");

Note that this won't work if the style attribute is immediately after the <p (with any number of spaces).

Updated demo.


To remove the attribute style anywhere in the html, you can perhaps use (a bit safer than the previous one maybe):

string output = Regex.Replace(input, @"(?<=<p)([^>]*?)\s*style=""[^"">]+""", "$1");

Reupdated demo.

Jerry
  • 70,495
  • 13
  • 100
  • 144
  • 1
    This is a good start, but wipes out *every* attribute not just the style attribute. Plug `

    ` into it

    – Jamiec Aug 14 '13 at 09:42
  • for now I just need to get rid of all attributes, but sure it would be good to learn removing specific attributes. – Tolga Evcimen Aug 14 '13 at 10:05
  • @TolgaEvcimen Sorry to having taken some more time to have one regex to remove the style attribute anywhere in the p tag. I just updated the regex to be able to do this. – Jerry Aug 14 '13 at 10:33
  • I already put your previous regex in action, and it works fine right now. I'll look into the tutorial that you suggest. I hope I won't be posting questions about regex matching again thanks to you:) – Tolga Evcimen Aug 14 '13 at 11:08
  • @TolgaEvcimen That's okay to post regex questions as long as there's the attempt and/or effort shown. And it's usually best with what should happen and what shouldn't happen to make sure everyone understands your question :) You're welcome! – Jerry Aug 14 '13 at 11:12
0

Not sure how to do it in c#, but using a general example in bash regex, I would do:

echo "$pattern" | sed -r 's/(<p).*(>)/\1\2/'

Where:

(<p) ----- Captures the opening bracket with p
.*   ----- Anything inbetween up to the next ">"
()   ----- Captures the closing bracket
\1\2 ----- Gives you back the two captured things, 
           in this order, with no space inbetween

Hope it helps, but again, you need to look up for replacing in c# yourself.

Jamiec
  • 133,658
  • 13
  • 134
  • 193
Juto
  • 1,246
  • 1
  • 13
  • 24