I don't know if you can do what you want with a single regular expression, especially since Oracle's implementation of regular expressions does not support lookaround. But there are some things you can do with SQL to get around these limitations. The following will extract the matches for the pattern, first by removing comments from the text, then by matching the patter src=".*\.js"
in what remains. Multiple results are retrieved using CONNECT BY
:
SELECT html_id, REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') AS match
FROM (
SELECT html_id, REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html
FROM (
SELECT 1 AS html_id, '<!------<script type="text/javascript" src="js/Shop.js"></script> -->
<!----<script type="text/javascript" src="js/Shop.js"></script> -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!---->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending
-- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script>
<script type="text/javascript" src="jquery.cookie.js"></script>' AS html_text
FROM dual
)
)
CONNECT BY REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') IS NOT NULL
AND PRIOR html_id = html_id
AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL;
If these results are stored in a table somewhere, then you would do the following:
SELECT html_id, REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') AS match
FROM (
SELECT html_id, REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html
FROM mytable
)
CONNECT BY REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') IS NOT NULL
AND PRIOR html_id = html_id
AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL;
It seems strange but the final two lines is necessary to avoid duplicate results.
Results as follows:
| HTML_ID | MATCH |
+---------+------------------------------------+
| 1 | src="jquery.serialize-object.js" |
| 1 | src="jquery.serialize-object.js" |
| 1 | src="jquery.serialize-object.js" |
| 1 | src="jquery.serialize-object.js" |
| 1 | src="jquery.cookie.js" |
+---------+------------------------------------+
SQL Fiddle HERE.
Hope this helps.
EDIT: Edited according to my comment below:
SELECT html_id, REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') AS match
FROM (
SELECT html_id, REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html
FROM (
SELECT 1 AS html_id, '<!------<script type="text/javascript" src="js/Shop.js"></script> -->
<!----<script type="text/javascript" src="js/Shop.js"></script> -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!---->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending
-- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script>
<script type="text/javascript" src="jquery.cookie.js"></script>' AS html_text
FROM dual
)
)
CONNECT BY REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') IS NOT NULL
AND PRIOR html_id = html_id
AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL;
EDITED
If you're searching a CLOB
rather than a CHAR
column, the first line of the CONNECT BY
clause should look like this. REGEXP_SUBSTR()
will return a CLOB
if the relevant column is a CLOB
, and the comparison just takes forever in this case:
CONNECT BY DBMS_LOB.SUBSTR(REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i'), 4000, 1) IS NOT NULL
Hope this helps.