1

I know this has been asked many times but I continue to fail when the answers are applied to my use case.

Content is being exported as HTML in a JSON value from a content management system via a REST API in preparation to be posted to Slack. Unfortunately, Slack does not support tables in messages. I want to delete all tables from the exported HTML using JavaScript. This is an example of how the HTML looks after being processed:

<div style="clear:both;">
<p><strong>Managed Value Set Changes</strong></p>
<table class="top">
<tbody>
<tr>
<td>
<p><strong>Value Set</strong></p>
</td>
<td width="411">
<p><strong>Change Description</strong></p>
</td>
</tr>
<tr>
<td>
<p>Value sets</p>
</td>
<td width="411">
<p>Updated value sets and added 5 new value sets for Model Year 3 </p>
</td>
</tr>
</tbody>
</table>
<p>Release Notes</p>
<p>Content updates for this month include updates to each of the following:</p>
<p></p>
<p>Updates to our COVID-19 value sets have been released. A detailed summary of the changes is available at the following link: <a href="">COVID-19 Coding</a></p>
<p><strong>Monthly updates to standard terminologies.</strong></p>
<table>
<tbody>
<tr>
<td width="185">
<p><strong>Content Area</strong> </p>
</td>
<td width="408">
<p><strong>Change Description</strong> </p>
</td>
</tr>
<tr>
<td width="185">
<p>ABC</p>
</td>
<td width="408">
<p>Updates to standard terminology</p>
</td>
</tr>
</tbody>
</table>
<p>Please contact us for more details.</p>
</div>

The desired output is the same HTML only without tables:

<div style="clear:both;">
<p><strong>Managed Value Set Changes</strong></p>
<p>Release Notes</p>
<p>Content updates for this month include updates to each of the following:</p>
<p></p>
<p>Updates to our COVID-19 value sets have been released. A detailed summary of the changes is available at the following link: <a href="">COVID-19 Coding</a></p>
<p><strong>Monthly updates to standard terminologies.</strong></p>
<p>Please contact us for more details.</p>
</div>

I recognize that the HTML may not be best practice. I don't have control over how the CMS exports its content.

I have tried:

str.replace(/<table.*?>[\s\S]+</table>/g, "")

str.replace(/<table.*?>[\s\S]*?</table>/g, "")

str.replace(/[\s\S]*<table>(.*?)</table>[\s\S]*/g, "")

I am clearly missing something obvious and would be grateful for any insight. Thank you for any help.

hcdocs
  • 1,078
  • 2
  • 18
  • 30
  • 3
    [Regex is not an HTML parser](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). If you need to change the DOM tree that your HTML source represents, [_parse the HTML source code into a DOM tree_](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser) and then remove/replace the table nodes however you like, then get the new HTML source code with `.outerHTML` on the top element. No regex, just normal JS operating on normal DOM content. And a lot more future-proof. – Mike 'Pomax' Kamermans Jun 26 '23 at 17:39
  • 1
    You need to escape the `/` in `` – Barmar Jun 26 '23 at 17:40
  • 1
    When I fix that, the second version works. – Barmar Jun 26 '23 at 17:41
  • 1
    You can also use `.*?` along with the `s` flag to allow `.` to match newline. – Barmar Jun 26 '23 at 17:42

1 Answers1

0

you should add s flag to your regex so . would match a new line. And don't to forget to escape '/':

const str = `<div style="clear:both;">
<p><strong>Managed Value Set Changes</strong></p>
<table class="top">
<tbody>
<tr>
<td>
<p><strong>Value Set</strong></p>
</td>
<td width="411">
<p><strong>Change Description</strong></p>
</td>
</tr>
<tr>
<td>
<p>Value sets</p>
</td>
<td width="411">
<p>Updated value sets and added 5 new value sets for Model Year 3 </p>
</td>
</tr>
</tbody>
</table>
<p>Release Notes</p>
<p>Content updates for this month include updates to each of the following:</p>
<p></p>
<p>Updates to our COVID-19 value sets have been released. A detailed summary of the changes is available at the following link: <a href="">COVID-19 Coding</a></p>
<p><strong>Monthly updates to standard terminologies.</strong></p>
<table>
<tbody>
<tr>
<td width="185">
<p><strong>Content Area</strong> </p>
</td>
<td width="408">
<p><strong>Change Description</strong> </p>
</td>
</tr>
<tr>
<td width="185">
<p>ABC</p>
</td>
<td width="408">
<p>Updates to standard terminology</p>
</td>
</tr>
</tbody>
</table>
<p>Please contact us for more details.</p>
</div>`;

console.log(str.replace(/<table.*?>.+?<\/table>\s*/gs, ''));

enter image description here

Alexander Nenashev
  • 8,775
  • 2
  • 6
  • 17
  • @hcdocs ... The above suggested approach already fails for a 2nd, nested table. One does not want a solution which (mis/ab)uses regex for html parsing. – Peter Seliger Jun 26 '23 at 17:58
  • 1
    @PeterSeliger i agree about the abusing, but more often an OP just wants an answer to his/her particular question. nested tables are processed with recursion as test also. – Alexander Nenashev Jun 26 '23 at 20:41