10

I am looking for a regular expression which matches the pattern src="*.js", but this should not be enclosed in a comment.

consider the following

<!------<script type="text/javascript" src="js/Shop.js"></script>  -->
<!----<script type="text/javascript" src="js/Shop.js"></script>  -->
<script type="text/javascript" src="jquery.serialize-object.js"></script>
<script type="text/javascript" src="jquery.cookie.js"></script>

Extended sample input, described by OP as "correct":

<!------<script type="text/javascript" src="js/Shop.js"></script>  -->
<!----<script type="text/javascript" src="js/Shop.js"></script>  -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!---->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending
-- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script>
<script type="text/javascript" src="jquery.cookie.js"></script>

The result should not match line 1 and 2 (where the content is enclosed with comment). It should only match line 3 and 4 (3-end, except comment-end line, for extended sample input).

So far I have this regexp which selects all my .js files but also the ones that are commented out: (src=\")+(\S)+(.js)

I am looking for a regex which only selects the script tags with a .js src attribute that are not surrounded by a comment.

I would also like to mention that I am using this regular expression in an Oracle PL SQL query.

Gurmanjot Singh
  • 10,224
  • 2
  • 19
  • 43
  • The accepted answer actually does the opposite of what was asked. This is why I am putting a bounty. We need a complete and correct answer. – mags Mar 27 '18 at 13:53
  • Can we assume that the input is always on a single line? Can we assume that the input is not followed by a comment on the same line? Can we assume that the input is not preceded by a comment on the same line? – Yunnosch Mar 27 '18 at 14:17
  • You can assume it is always on one line and that there will not be a comment before the tag. However, there may be a comment starting after the tag on the same line. – Umair Tarafdar Mar 27 '18 at 14:28
  • Please provide more sample input, demonstrating that. – Yunnosch Mar 27 '18 at 14:29
  • Can we assume that comments on line end do not contain `>`? – Yunnosch Mar 27 '18 at 14:42
  • Because of the bounty, an answer which does not actually solve the problem should not stay accepted. (Sorry, Lance.) Please double-check. This does not mean you should decide which answer to accept before the bounty triggers. – Yunnosch Mar 27 '18 at 15:46
  • Are you going to do a replacement? – revo Mar 27 '18 at 15:57
  • 1
    While it is easy to construct a query with regular expressions in Oracle SQL that produces the necessary output for this exact input, this approach will inevitably fail with more varied data. Consider the case when an attribute has `>` or `<` inside. Also see [this question](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454). If it is absolutely necessary to parse html in pl/sql you can look at [this example](https://stackoverflow.com/questions/35457126/handling-html-data-in-oracle-query), that makes use of dbms_xmldom package. – wolfrevokcats Mar 28 '18 at 00:15
  • Is there any way to test regular expressions without installing an Oracle database ? Like some online tester that would check the regexp with PL/SQL syntax ? – Caillou Mar 28 '18 at 17:24
  • Giving the link to the online test was very good. Please edit your question to contain it and I also recommend to add the sample input given in the online tester. – Yunnosch Mar 29 '18 at 14:13
  • I perceive your statement "You can assume it is always on one line and that there will not be a comment before the tag. However, there may be a comment starting after the tag on the same line." to be in contrast to the sample input you provide in the online tester. Especially the fact that part of the input spans more than one line without being the case of the comment starting after the part to be matched. For me that is exactly the point which crosses the border to where regexes stop being the right tool and you should switch to an XML parser. See the link by @wolfrevokcats. – Yunnosch Mar 29 '18 at 14:17
  • Are the values over which you're searching in one large text field (like a CLOB), or stored in multiple rows? – David Faber Mar 29 '18 at 19:56
  • So many unanswered comments here. You are not gonna find an answer if you ignore others comments. – revo Mar 30 '18 at 23:03
  • Does this has to be regexp? Why you cannot use Like and then set operators like minus for example? – Art Apr 02 '18 at 19:18

6 Answers6

3

For e.g. this sample input:

<!------<script type="text/javascript" src="js/Shop.js"></script>  -->
<!----<script type="text/javascript" src="js/Shop.js"></script>  -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!---->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending
-- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script>
<script type="text/javascript" src="jquery.cookie.js"></script>

This regex: src="[^"]*\.js\"></script>(\s*<!--[^>]*-->)*(\s*<!--[^>]*)?$
will give you this output:

<script type="text/javascript" src="jquery.serialize-object.js"></script><!---->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending
<script type="text/javascript" src="jquery.serialize-object.js"></script>
<script type="text/javascript" src="jquery.cookie.js"></script>

I tested with GNU grep 2.5.4, hoping that it gets close enough to your regex flavor. The regex is very light on special features.

Explanation:

  • \"[^"]* is "anything within " "
  • (<!--[^>]*-->)* is "any number of complete comments, if they do not contain > "
  • (<!--[^>]*)?$ is "an optional start of a non-> comment at the end of a line"
  • \s* allowing optional white space

Note, at some point of possible complexity of relevant input, regexes stop being the right tool. Beyond, a dedicated tool, i.e. a parser for XML/html whatever is the choice.
For me that point is reached when the possibility occurs of the relevant input being "hidden" inside a multiline comment. I feel that you turned the question into a moving target, by first confirming that expecting relevant input on one line is allowed (apart from a comment starting afterwards) but then changed the rules, by adding contradicting sample input. At one point you did describe the sample input I proposed as "correct".
The (very funny) XML/regex discussing QA linked in the comments demonstrates the hell you can end up in, if you do not draw the line early enough.
When restricted into a given environment, e.g. SQL server, the special abilities of that environment should be leveraged. Surely processing the non-commented parts of the input by SQL mechanisms to achieve a some steps further goal is possible. I.e. drop your immediate idea of how to proceed and take a little detour in thinking. Try to make sure that you do not exhaust yourself on a XY-problem.

Yunnosch
  • 26,130
  • 9
  • 42
  • 54
  • Thanks for the detailed answer, The sample input that you are considering is correct. However this regex does not work in PL/SQL environment. – Umair Tarafdar Mar 27 '18 at 15:48
  • I will be available in some 30 minutes. And will try to setup a chat. I will try to adapt to your regex flavor. Can you provide info? E.g. anything the environment complains about in the regex? E.g. "unmatched ()" or "cannot find string terminator" or some such... – Yunnosch Mar 27 '18 at 15:51
  • Hmm, I think with your reputation, you will not be able to talk in chat. Can anybody help how I can efficiently discuss with OP? – Yunnosch Mar 27 '18 at 16:14
  • Please start experimenting: `"` -> `\"` (treat the one inside `[]`separatly), or `?` -> `*`. Check whether parts of the regex match, e.g. only `src="[^"]*\.js\"` or `src=\"[^"]*\.js\"` or `src=\"[^\"]*\.js\"`. – Yunnosch Mar 27 '18 at 16:18
  • 2
    Sigh. I probably have only paved the path for somebody else to reap the bounty... I would like it to be @LanceToth, he earned it. – Yunnosch Mar 27 '18 at 16:47
  • Sorry, it did work on PL SQL but was giving zero results. I created a regex link where we can try your query on: https://regexr.com/3n0bp Currently it contains Lance's query which is the closest we got. – Umair Tarafdar Mar 29 '18 at 07:59
  • I notice that you did not give feedback on my proposals to adapt the regex to your flavor. Instead you change the rules by adding conflicting sample input, while not updating the one in the question. I consider this a moving target question now. Any downvoters, please explain the reason based on the initial question and first few comments. – Yunnosch Mar 29 '18 at 14:32
  • 1
    Thanks for your help, your solution was very close but unfortunately kept struggling with comments which included extra text like I apologize if the initial post was not clear. In the end the solution by David Faber helped us solve the problem by first removing all comments and then running a simple regex instead. – Umair Tarafdar Apr 03 '18 at 08:35
3

I don't know if you can do what you want with a single regular expression, especially since Oracle's implementation of regular expressions does not support lookaround. But there are some things you can do with SQL to get around these limitations. The following will extract the matches for the pattern, first by removing comments from the text, then by matching the patter src=".*\.js" in what remains. Multiple results are retrieved using CONNECT BY:

SELECT html_id, REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') AS match
  FROM (
    SELECT html_id, REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html
      FROM (
        SELECT 1 AS html_id, '<!------<script type="text/javascript" src="js/Shop.js"></script>  -->
        <!----<script type="text/javascript" src="js/Shop.js"></script>  -->
        <script type="text/javascript" src="jquery.serialize-object.js"></script><!---->
        <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->
        <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending
        -- afterwards -->
        <script type="text/javascript" src="jquery.serialize-object.js"></script>
        <script type="text/javascript" src="jquery.cookie.js"></script>' AS html_text
          FROM dual
    )
)
CONNECT BY REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') IS NOT NULL
   AND PRIOR html_id = html_id
   AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL;

If these results are stored in a table somewhere, then you would do the following:

SELECT html_id, REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') AS match
  FROM (
    SELECT html_id, REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html
      FROM mytable
)
CONNECT BY REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') IS NOT NULL
   AND PRIOR html_id = html_id
   AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL;

It seems strange but the final two lines is necessary to avoid duplicate results.

Results as follows:

| HTML_ID | MATCH                              |
+---------+------------------------------------+
|       1 | src="jquery.serialize-object.js"   |
|       1 | src="jquery.serialize-object.js"   |
|       1 | src="jquery.serialize-object.js"   |
|       1 | src="jquery.serialize-object.js"   |
|       1 | src="jquery.cookie.js"             |
+---------+------------------------------------+

SQL Fiddle HERE.

Hope this helps.

EDIT: Edited according to my comment below:

SELECT html_id, REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') AS match
  FROM (
    SELECT html_id, REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html
      FROM (
        SELECT 1 AS html_id, '<!------<script type="text/javascript" src="js/Shop.js"></script>  -->
        <!----<script type="text/javascript" src="js/Shop.js"></script>  -->
        <script type="text/javascript" src="jquery.serialize-object.js"></script><!---->
        <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->
        <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending
        -- afterwards -->
        <script type="text/javascript" src="jquery.serialize-object.js"></script>
        <script type="text/javascript" src="jquery.cookie.js"></script>' AS html_text
          FROM dual
    )
)
CONNECT BY REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') IS NOT NULL
   AND PRIOR html_id = html_id
   AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL;

EDITED

If you're searching a CLOB rather than a CHAR column, the first line of the CONNECT BY clause should look like this. REGEXP_SUBSTR() will return a CLOB if the relevant column is a CLOB, and the comparison just takes forever in this case:

CONNECT BY DBMS_LOB.SUBSTR(REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i'), 4000, 1) IS NOT NULL

Hope this helps.

David Faber
  • 12,277
  • 2
  • 29
  • 40
  • I realized that the pattern might need some tweaking, e.g., `src="[^"]*\.js"`- I think mine only works because I did not specify the `n` option in `REGEXP_SUBSTR()` and depending on the existence of newlines probably isn't good practice. – David Faber Mar 30 '18 at 10:53
  • 1
    Thanks, removing the comments first before running a regex solved the problem! I will accept your answer as the correct one for the thread. – Umair Tarafdar Apr 03 '18 at 08:36
2

I've put a negative look-ahead before the end of your regex, but mind that if there's a commented part after the src it will likewise be ignored.

(src=\")+(\S)+(\.js\")+(?!.*-->)(.*)

Edit:

I managed something similiar without the lookahead (which PL/SQL doesn't have):

(src=\")(\S)+(\.js\")[^(--)\n]+(\n|$)
Lance Toth
  • 430
  • 3
  • 17
  • This regex works perfectly in Javascript. However I am trying to run this regex in a SQL query in PL/SQl developer. It does not seem to work here. Does anyone know how to work with negative look ahead in SQL queries? – Umair Tarafdar Mar 22 '18 at 10:03
  • @umairtarafdar I edited my answer, hopefully this will help you – Lance Toth Mar 22 '18 at 10:21
  • The regex you came up with for PL/SQL is actually doing the opposite of what I am looking for, it's matching the results enclosed in the comments. Is there some way i can negate this? – Umair Tarafdar Mar 22 '18 at 10:49
  • @UmairTarafdar I can't imagine how that is possible, sorry – Lance Toth Mar 22 '18 at 10:59
  • This probably doesn't work because `--` converts the rest of the line into a [comment](https://docs.oracle.com/cd/B28359_01/appdev.111/b28370/comment.htm#LNPLS01308) in PL SQL – TheChetan Mar 28 '18 at 17:48
  • TheChetan is right, we had to backspace the -- in Lance's solution to make it work properly. Now the only issue is that it doesn't detect comments which span multiple lines but this already a huge progress. @Lance Toth - any ideas on this? – Umair Tarafdar Apr 03 '18 at 07:59
  • I will look into it some more. This is exactly what lookahead and lookbehind is for, but those are not available for you – Lance Toth Apr 03 '18 at 08:21
0

Here is my solution : one simple negative lookbehind.

(?<!<!--.+)src=".+\.js"

This matches all the src attributes in your extended example, but not those preceded by <!--. It might just be enough, tell me if I missed some specific cases ;)

Here is my solution running on your extended example : https://regex101.com/r/rmHkbm/1

EDIT : This is working in javascript, I don't know for ORACLE PL/SQL. Is there any way to test it without installing an Oracle database ?

Caillou
  • 1,451
  • 10
  • 19
  • 1
    Did you notice OPs comments on the other negative lookaround based answer, that it is not clear how to make them work in the target environment? Could you explain how your lookaround is different and easier applicable than the rejected/unhelpful one? Hmmm, I notice this sounds like an attempt to shoot down your answer. Please take it instead as a proposal how to improve it beyond the others (including mine); with the lack of downvotes as an indicator of truthfulness. – Yunnosch Mar 29 '18 at 14:39
  • "Is there any way to test it without installing an Oracle database ?" << SQL Fiddle – David Faber Mar 29 '18 at 20:42
  • @Yunnosch No problem. I actually didn't see that the other lookaround based answers were already dismissed by OP. Here, have an upvote for your kindness. – Caillou Apr 03 '18 at 12:20
0

I don't think it's possible to do what you want using a single regular expression without negative lookaround. But, you can do it by logically combining two similar regular expressions in a way that's easy to do in SQL. The basic idea is:

[MATCH_EXPR] AND NOT [COMMENTED_MATCH_EXPR]

Assume we have a table data with a column line (lines of code), we could select the lines of interest with something like:

SELECT line
FROM data
WHERE REGEXP_LIKE(line, 'src="[^"]+.js"') AND NOT REGEX_LIKE(line, '<!--.*src="[^"]+.js"');

You can update the regular expressions to be more precise and/or do something more sophisticated with them, e.g. capture the file names, but the approach would be the same.

This approach is not bulletproof in that it would fail to find lines that consist of two <script> statements where only the second one is commented out, since the second regular expression would match. Nevertheless, it would likely work for the vast majority of typical code, including the examples given above.

cryptoplex
  • 1,235
  • 10
  • 14
0

I have tried the below on https://livesql.oracle.com, so probably will work for you. assuming an uncommented line starts with '<script'. It matches the lines like

    <script type="text/javascript" src="jquery.cookie.js"></script>
    <script type="text/javascript" src="jquery.serialize-object.js"/>
    <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->

query with regular expressions:

select "SRC" from "TABLE_1" 
where REGEXP_LIKE (SRC, '^\<script.+\.js.+script\>$') 
   or REGEXP_LIKE (SRC, '^\<script.+\.js.+script\>\<\!\-\-.+\-\-\>$') 
   or REGEXP_LIKE (SRC, '^\<script.+\.js.+\/\>$');