I am trying to run the following query to go through a bunch of CLOBs containing HTML pages:
SELECT template_id, REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') AS match
FROM (
SELECT template_id, REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html
FROM (
SELECT t.id as template_id, dbms_lob.substr(t.data, 4000, 1) AS html_text
FROM template t
)
)
CONNECT BY REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') IS NOT NULL
GROUP BY template_id,
clean_html,
LEVEL
order by 1
But I keep getting "character string buffer too small on line 1" almost directly after running my query.
If I run it on a small dataset with only a few templates it works, but once I run it on my entire HTML template list then it throws the error.
I think that the issue is related to dbms_lob.substr(data, 4000, 1) and probably has something to do with the substring exceeding 4000 bytes because of hidden characters (or something?), but I don't know how to fix it.
If I write dbms_lob.substr(data, 2000, 1) then my substring is too small and I lose important data from my HTML file. If I make it larger than 4000 then I instantly get the "buffer too small" error.
Anyone know how I can fix this? Ideally, I'd like to go through my entire 'data' field and not just the first 4000 characters. However, the first 4000 characters would be fine if it works on my long list of HTML files.
Thank you