1

Problem:

I have a servlet that generate reports, more specifically the table body of a report. It is a black box, we do not have access to the source code.

Nevertheless, its working satisfactory, and the servlet is not planned to be rewritten or replaced anytime soon.

We need to modify its response text in order to update a few links it generates to other reports, I was thinking of doing it with a filter that would find the anchor text and replace it using a regex.

Research:

I ran into this question that has a regex filter. It should be what I need, but then maybe not.

I am not trying to parse HTML in the strict sense of the parsing term, and I am not working with the full spec of the language. What I have is a subset of HTML tags that compose a table body, and does not have nested tables, so the HTML subset generated by the servlet is not recursive.

I just need to find / replace the anchors targets and add an attribute to the tag.

So the question is:

I need to modify the output of a servlet in order to change all links of the kind:

<a href="http://mypage.com/servlets/reports/?a=report&id=MyReport&filters=abcdefg">

into links like:

<a href="http://myOtherPage.com/webReports/report.xhtml?id=MyReport&filters=abcdefg" target="_parent">

Should I use the regex filter written by @ Jeremy Stein or is there a better solution?

Community
  • 1
  • 1
Mindwin Remember Monica
  • 1,469
  • 2
  • 20
  • 35
  • 2
    I certainly wouldn't use regex for parsing HTML, but perhaps something like this would work for the URL itself. For humour and dire warning you should read this: http://stackoverflow.com/a/1732454/650425 – maple_shaft Oct 24 '12 at 11:16
  • @maple_shaft lots of lols from the team over my shoulder on that question. As for the servlet output, we tested and it writes valid XML. If I needed to convert it into a data structure I would parse it using a XML parser. I just need to modify sections of it before it is sent in the response. – Mindwin Remember Monica Oct 24 '12 at 12:00
  • When you say: _"links of the kind..."_ do you mean all links to a specific host or domain? Or just those having that specific URL? Or only those having path=`servlets/reports/` You need to be a bit more explicit on precisely which anchor links you wish to modify. Also, will the anchors have any other attributes? – ridgerunner Oct 24 '12 at 14:07
  • Thanks ridge, I went into the regex path anyway. I was not parsing the HTML, just making a find-replace on the output. runs fast and does the job. – Mindwin Remember Monica Oct 24 '12 at 16:25
  • Caveat for those arriving here: Avoid the Lazy Dot .*? like the plague if you can.http://www.regular-expressions.info/catastrophic.html – Mindwin Remember Monica Feb 21 '14 at 15:12

2 Answers2

1

Assuming that the only part of the target A tags which vary is the query component of the href attribute, then this tested regex solution should do a pretty good job:

// TEST.java 20121024_0800
import java.util.regex.*;
public class TEST {
    public static String fixReportAnchorElements(String text) {
        Pattern re_report_anchor = Pattern.compile(
            "<a href=\"http://mypage\\.com/servlets/reports/\\?a=report&id=([^\"]+)\">",
            Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
        Matcher m = re_report_anchor.matcher(text);
        return m.replaceAll(
            "<a href=\"http://myOtherPage.com/webReports/report.xhtml?id=$1\" target=\"_parent\">"
            );
    }
    public static void main(String[] args) {
        String input =
            "test <a href=\"http://mypage.com/servlets/reports/?a=report&id=MyReport&filters=abcdefg\"> test";
        String output = fixReportAnchorElements(input);
        System.out.println(output);
    }
}
ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • Yes, that was what I had in mind. I used Jeremy Stein's class (see link in question), with some changes. I just wanted to know if the regex filter was the way to go for editing the servlet output. I added my pattern and replaceString to your answer, +1 and accepted it. – Mindwin Remember Monica Oct 24 '12 at 16:20
0

I used Jeremy Stein (click to go to question) classes, with a few changes:

a) Make sure nobody down the filter chain or the servlet DO NOT call getOutputStream() on the wrapper object, or it will throw an invalidStateException (check this answer by BalusC on the subject).

b) I wanted to make a single change on the page, so I did not put any filterConfig on the web.xml.

b.2) I also did not put anything on the web.xml at all. Used the javax.servlet.annotation.WebFilter on the class itself.

c) I set the Pattern and replace strings directly on the class:

Pattern searchPattern = Pattern.compile("<a (.*?) href=\".*?id=(.*?)[&amp;|&]filtros=(.*?)\" (.*?)>(.*?)</a>");
String replaceString = "<a $1 href=\"/webReports/report.xhtml?idRel=$2&filtros=$3\" target=\"_parent\" $4>$5</a>";

note the .*? to have as little as possible matched, to avoid matching more than wanted.

For testing the matching and the regex, I used this applet I found while researching the subject.

Hope this helps anyone with the same problem.

Community
  • 1
  • 1
Mindwin Remember Monica
  • 1,469
  • 2
  • 20
  • 35