0

I am trying to run the following query to go through a bunch of CLOBs containing HTML pages:

SELECT template_id, REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') AS match
  FROM (
    SELECT template_id,  REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html
     FROM (
           SELECT t.id as template_id, dbms_lob.substr(t.data, 4000, 1) AS html_text
           FROM template t
     )
  )
CONNECT BY REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') IS NOT NULL
 GROUP BY template_id,
      clean_html,
      LEVEL
order by 1

But I keep getting "character string buffer too small on line 1" almost directly after running my query.

If I run it on a small dataset with only a few templates it works, but once I run it on my entire HTML template list then it throws the error.

I think that the issue is related to dbms_lob.substr(data, 4000, 1) and probably has something to do with the substring exceeding 4000 bytes because of hidden characters (or something?), but I don't know how to fix it.

If I write dbms_lob.substr(data, 2000, 1) then my substring is too small and I lose important data from my HTML file. If I make it larger than 4000 then I instantly get the "buffer too small" error.

Anyone know how I can fix this? Ideally, I'd like to go through my entire 'data' field and not just the first 4000 characters. However, the first 4000 characters would be fine if it works on my long list of HTML files.

Thank you

MT0
  • 143,790
  • 11
  • 59
  • 117
Piet Smet
  • 69
  • 7
  • What happens if you don't use `dbms_lob.substr` on `t.data` and use it outside? – Kaushik Nayak Apr 03 '18 at 14:13
  • If your HTML is actually well formatted XML then you can parse it as XML rather than trying to use regular expressions. – MT0 Apr 03 '18 at 14:58
  • Why the `GROUP BY`? – David Faber Apr 03 '18 at 15:08
  • This code looks familiar ... https://stackoverflow.com/questions/49274050/regular-expression-to-select-a-particular-content-provided-it-is-not-enclosed-i/49564489#49564489 – David Faber Apr 03 '18 at 20:10
  • Also, this question may be related: https://stackoverflow.com/questions/13819551/dbms-lob-substr-throwing-character-string-buffer-too-small-error?rq=1 – David Faber Apr 03 '18 at 22:02
  • Hi David, that's correct! I'm working with the user from the other post. The problem is a new one though, as the original question was about regex solutions and not buffer size problems. Thanks for your initial help though! – Piet Smet Apr 04 '18 at 15:46

1 Answers1

0

I'm assuming you're using GROUP BY to get distinct values, right? The problem with using CONNECT BY here over lots of records is that it will return a lot of duplicates. But GROUP BY and DISTINCT are not the proper way to handle this. Take out the GROUP BY and append the following lines to your query:

AND PRIOR template_id = template_id
AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL

Also, I think you want to add the following condition to your innermost query:

WHERE REGEXP_LIKE(t.data, 'src="[^"]*\.js"', 'i')

Otherwise, the query will always return at least one row for each value of template_id.

I'm not quite sure what is happening with the CLOBs but I haven't been able to reproduce your error exactly.

EDITED

I think I have a solution. Please try the following:

SELECT template_id, REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') AS match
  FROM (
    SELECT template_id,  REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html
     FROM (
       SELECT t.id as template_id, t.data AS html_text
         FROM template t
        WHERE REGEXP_LIKE(t.data, 'src="[^"]*\.js"', 'i')
     )
  )
CONNECT BY TO_CHAR(REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i')) IS NOT NULL
  AND PRIOR template_id = template_id
  AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL;

I believe the problem lay in the comparison in the CONNECT BY clause

CONNECT BY REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') IS NOT NULL

It's better to compare a CHAR here than a CLOB:

CONNECT BY TO_CHAR(REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i')) IS NOT NULL

You could even use DBMS_LOB.SUBSTR() here to be safe:

CONNECT BY DBMS_LOB.SUBSTR(REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i'), 4000, 1) IS NOT NULL

Finally, I don't think you need that many levels of subqueries. I think you could do the following:

SELECT template_id, REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') AS match
  FROM (
    SELECT t.id AS template_id, REGEXP_REPLACE(t.data, '<!--.*?-->', '', 1, 0, 'n') AS clean_html
      FROM template t
     WHERE REGEXP_LIKE(t.data, 'src="[^"]*\.js"', 'i')
  )
CONNECT BY DBMS_LOB.SUBSTR(REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i'), 4000, 1) IS NOT NULL
  AND PRIOR template_id = template_id
  AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL;

Hope this helps.

David Faber
  • 12,277
  • 2
  • 29
  • 40
  • 1
    Hi David, this solved the issue! It also made my query run a lot faster, thank you! You were right about all your assumptions regarding the connect by as well. It was mainly due to my limited knowledge of SQL. Thanks for teaching me something new about PRIOR. – Piet Smet Apr 04 '18 at 15:57