JSP Text Processing with Regex

Question

I have a large number (>1500) of JSP files that I am trying to convert to JSPX. I am using a tool that will parse well-formed JSPs and convert to JSPX, however, my JSPs are not all well-formed :)

My solution is to pre-process the JSPs and convert untidy code so the tool will parse them correctly. The main problem I am trying to resolve is that of unquoted attribute values. Examples:

<INPUT id="foo" size=1>
<input id=body size="2">

My current regex for finding these is (in Java string format):

"(\\w+)=([^\"' >]+)"

And my replacement string is (in Java string format):

"$1=\"$2\""

This works well, EXCEPT for a few patterns, both of which involve inline scriptlets. For example:

<INPUT id=foo value="<%= someBean.method("a=b") %>">

In this case, my pattern matches the string literal "a=b", which I don't want to do. What I'd like to have happen is that the regex would IGNORE anything between <% and %>. Is there a regular expression that will do what I am trying to do?

EDIT: Changed to title to clarify that I am NOT trying to parse HTML / JSP with regexes... I am doing a simple syntactic transformation to prepare the input for parsing.

It looks like you are trying to match an XML-like language with regular expressions. You might want to read http://stackoverflow.com/a/1732454/159388 before continuing along this path. — murgatroid99, May 23 '12 at 19:59
No, I'm not trying to parse XML with regular expressions. As I mention in the question above, I am using another tool that parses JSP. I am trying to do a lexical pre-processing of the text, before the parser does its work. — Steve H., May 23 '12 at 21:10

dragon66 · Answer 1 · 2012-05-24T05:25:14.200

Based on the assumption that there are NO unquoted attribute values inside the scriptlets, the following construct might work for you:

Note: this approach is fragile. Just for your reference.

import java.util.regex.*;

public class test{
  public static void main(String args[]){
    String s = "<INPUT id=foo abbr='ip ' name =  bar color =\"blue\" value=\" <%= someBean.method(\" a = b \") %>\" nickname =box  >";
    Pattern p = Pattern.compile("(\\w+)\\s*=\\s*(\\w+[^\"'\\s])");
    Matcher m = p.matcher(s);
    while (m.find())
    { 
      System.out.println("Return Value :"+m.group(1)+"="+m.group(2));
    }
 }
}

Output:

Return Value:id=foo
Return Value:name=bar
Return Value:nickname=box

score 0 · Accepted Answer · answered May 24 '12 at 05:46

If a sentence contains an arbitrary number of matching tokens such as double quotes, then this sentence belongs to a context-free language, which simply cannot be parsed with Regex designed to handle regular languages.

Either there could be some simplification assumptions (e.g. there are no unmatched double quotes and there is only a certain number of those etc.) that would permit the use of Regex, or your need to think about using (creating) a lexer/parser for a case of context-free language. ANTLR is a good tool for this.

JSP Text Processing with Regex

2 Answers2