0

I know you can remove HTML tags with a command such as this:

REGEXP_REPLACE(overview, '<.+?>')

But, some of the text has actual HTML encoding, where the application actually encoded things, like single quotes to be: &#39 or ’

I'm assuming these are pretty standard. Is there a way to remove them and replace them with the actual character, or am I stuck with REPLACE and listing them?

Many thanks!

Landon Statis
  • 683
  • 2
  • 10
  • 25

2 Answers2

1

Use a proper XML parser:

with t (overview) as (
  SELECT '<div><p>Some entities: &amp; &#39; &lt; &gt; to be handled </p></div>' from dual UNION ALL
  SELECT '<html><head><title>Test</title></head><body><p>&lt;test&gt;</p></body></html>' from dual
)
SELECT x.*
FROM   t
       CROSS JOIN LATERAL (
         SELECT LISTAGG(value) WITHIN GROUP (ORDER BY ROWNUM) AS text
         FROM   XMLTABLE(
                  '//*'
                  PASSING XMLTYPE(t.overview)
                  COLUMNS
                    value CLOB PATH './text()'
                )
      ) x

Which outputs:

TEXT
Some entities: & ' < > to be handled
Test<test>

db<>fiddle here

MT0
  • 143,790
  • 11
  • 59
  • 117
0

You can use utl_i18n.unescape_references():

utl_i18n.unescape_reference(regexp_replace(overview, '<.+?>'))

As a demo:

-- sample data
with t (overview) as (
  select '<div><p>Some entities: &amp; &#39; &lt; &gt; to be handled </p></div>'
  from dual
)
select REGEXP_REPLACE(overview, '<.+?>') as result1,
  utl_i18n.unescape_reference(regexp_replace(overview, '<.+?>')) as result2
from t

gets

RESULT1 RESULT2
Some entities: &amp; &#39; &lt; &gt; to be handled Some entities: & ' < > to be handled

db<>fiddle


I'm not endorsing (or attacking) the notion of using regular expressions; that's handled and refuted and discussed elsewhere. I'm just addressing the part about encoded entities.

Alex Poole
  • 183,384
  • 11
  • 179
  • 318