-2

I'm looking to remove certain style attributes from table tags only. So I might have a messy string like this:

<table border="0" width="641" style="width:480.95pt;margin-left:4.65pt;border-collapse:collapse;mso-padding-alt:0in 5.4pt 0in 5.4pt">
 <tbody><tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes;height:23.25pt">
  <td width="299" nowrap="" colspan="3" valign="top" style="width:224.45pt;border-top:
  solid windowtext 1.0pt;border-left:solid windowtext 1.0pt;border-bottom:none;
  border-right:solid black 1.0pt;mso-border-top-alt:solid windowtext .5pt;
  mso-border-left-alt:solid windowtext .5pt;mso-border-right-alt:solid black .5pt;
  background:#538ED5;padding:0in 5.4pt 0in 5.4pt;height:23.25pt">

And I need the regex to match width:XXXpt;. I tried with this and many variations and it doesn't work:

^<table.*(width:\wpt;?)\/table>$

An explanation would be greatly appreciated as I'm learning regex.

Whip
  • 1,891
  • 22
  • 43

1 Answers1

1

Regex is a whole world by itself. There are usually several approaches to reach the same result. Here is one of these approaches:

<table[^>]+style=[^>]*?[" ;]width:([0-9.]+pt)

I'll break it down to explain some of it, but to fully understand you'll need to learn the regex syntax.

[^>]+ - will select up to the closing tag. This will prevent searching in the next elements if the first one does not match.

style=[^>]*? - makes sure there's a style attribute in the tag.

[" ;]width: - makes sure the exact property is selected and not part of another property (like border-width).

([0-9.]+pt) - select one or more number and/or points.

You don't need to define the closing table tag because the first rule stops the search in the first tag anyway.

And finally, some words of wisdom every one of us learned over the years.

When it comes to working with a complex HTML structure, no matter which coding language it is, it's almost always better to use a dedicated parser than to treat it as a string.

Gil
  • 1,794
  • 1
  • 12
  • 18
  • Thanks a lot for your time. I couldn't find a good solution of HTML parsing to cleanup HTML so I'm doing this. The messy HTML comes when user pastes from MS Word in a JS WYSIWYG editor. With a few string replaces, the HTML size can be reduced from something like 100k chars to 30k chars. – Whip May 23 '21 at 02:20
  • BTW, here's what I ended up using: `]+style=[^>]*?[" ;](width:\s*[0-9.]+pt;?)`
    – Whip May 23 '21 at 02:22