-1

Hey everyone, I'm off on another coding adventure. I started teaching myself some basic RegEx earlier today, and made a little C# app that inputs a HTML file and a listbox of RegExes, then uses those RegExes to replace or remove HTML tags. I managed to make some functioning RegExes to clean and remove tags littering the tables, but I also need to remove the mess of hard-coded css styles and replace them with references to external ones. After a lot of trial and error, I finally came up with something that selects from <style type="text/css"> to </style> but for some reason it completely skips over separate blocks of style tags. It stops at the closing of the last one, though. This is more of a curiosity than a needed bit of information, this should work fine for now because I can just replace what is matched with a single <link> to the external css. As of right now, my RegEx is this:

<style((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)>(.*?\r\n)*(</style>)

The first half was taken from here, the middle bit was what I struggled most with, as I had forgotten about \r\n, and of course the closing tag was verbatim.

Like I said, this works fine, my only qualm is that of this code:

<style type="text/css">
<!--
#wrapper #content #main2col .modbox tr td {
    color: #3366cc;
    border-top-style: solid;
    border-right-style: solid;
    border-bottom-style: solid;
    border-left-style: solid;
}
#wrapper #content #main2col .modbox tr td p em {
    color: #0a304e;
}
#wrapper #content #main2col .modbox tr td em br {
    color: #0a304e;
}
#wrapper #content #main2col .modbox tr td em strong {
    color: #0a304e;
}
#wrapper #content #main2col p strong {
    color: #0a304e;
}
#wrapper #content #main2col table tr td strong {
    color: #0a304e;
}
-->
</style>
<style type="text/css">
<!--
table.modbox {
    font-size:9pt;
    font-HCMmily:"Calibri", "sans-serif";
    border-top-style: solid;
    border-right-style: solid;
}
p.modbox {
    margin-top:0in;
    margin-right:0in;
    margin-bottom:10.0pt;
    margin-left:0in;
    line-height:normal;
    font-size:11.0pt;
    font-HCMmily:"Calibri", "sans-serif";
}
#wrapper #content #main2col .modbox tr .modbox {
    color: #09C;
    font-style: normal;
}
#wrapper #content #main2col .modbox {
    color: #3366cc;
}
#wrapper #content #main2col .modbox {
    color: #3a5774;
}
#wrapper #content #main2col .modbox tr .modbox .MsoNormal .modbox {
    color: #3a5774;
}
#wrapper #content #main2col .modbox {
    color: #3a5774;
}
-->
</style>
<style type="text/css">
<!--
table.MsoTableGrid {
    border:solid;
    font-size:11.0pt;
    font-HCMmily:"Calibri", "sans-serif";
}
p.MsoNormal {
    margin-top:0in;
    margin-right:0in;
    margin-bottom:5pt;
    margin-left:0in;
    line-height:normal;
    font-size:10pt;
    font-HCMmily:"Calibri", "sans-serif";
}
-->
</style>
<style type="text/css">
<!--
table.modbox {
font-size:10.0pt;
font-family:"Times New Roman","serif";
}
-->
</style>

Only one match is returned. I'm trying to figure out why it doesn't catch the fist close tag of </style>. For the record, I tried adding (\r\n)? after the close tag bit, but that made no difference.

Again, today was my first day working with RegEx, so I'm really new to this, I could be making a very simple mistake.

Can anyone explain what I've done wrong? Any assistance is greatly appreciated!

Omega192
  • 343
  • 1
  • 5
  • 15
  • HTML parsing with regex'es is generally bad idea: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Oleks Apr 23 '11 at 08:03
  • There is a second issue with the regex above : The closing style tag would never match. It must be e.g. (<[/]style>) to match the backslash ! – sebilasse Apr 07 '17 at 09:54
  • Do NOT use regexp for HTML tags! Use a parser instead... – c24b Oct 30 '17 at 15:53

1 Answers1

3

I would say that you have greedity issue with your regexp. Most probably the you should check all your stars (*) and plus (+) to add an interrogation mark (?) after them. like

 (.*?\r\n)* => (.*?\r\n)*?

On a side note, wanting to parse html / xml with a regex is a bad idea, why not use a simple html parser and retrieve content of your tag ?

Bruce
  • 7,094
  • 1
  • 25
  • 42
  • Greedity you say? I had seen that come up on several of the articles I read, never quite got it, though, I'll do some further research and try that. RegEx was the first thing that came to mind, plus I've been wanting to learn some basics of it. I had seen other questions on here that mentioned it was a poor choice with HTML as it isn't a regular language, and that parsers are better. Though, I have no idea how to work parsers, so I'll look into that as well. If your suggestion works I'll be accepting your answer. Thanks for the quick and helpful response! – Omega192 Apr 23 '11 at 09:07
  • Sure enough, that one character change made it work properly. I had a feeling it'd be something really simple haha. Thank you very much! Answer accepted :] – Omega192 Apr 23 '11 at 09:43
  • 1
    default behaviour of regexp if greed : match as mutch as possible per group, to change it, you add an ? after a multiplying symbol (* or +), it means match but keep group as small as possible to allow further matching. – Bruce Apr 23 '11 at 10:03