-4

I have a requirement. I have a string which has a value for eg:

<p>We are pleased <a href="http://www.anc.com/content/cy-tech/global/en/cq5-reference-materials.s_cid_123.html">to present the new product type</a>. This new product type is the best thing since sliced bread. We are pleased to present the new product type. This new product <a href="mailto:abc@gmail.com">type is the best</a> thing since sliced bread.</p>

The above text will be stored as a single string value. I need to append certain parameters to the hrefs after checking the criteria. Let me know how to extract only the href and append the parameter and display the string without damage (FYI : the string is the value entered through RTE - rich text editor)

Tried this approach but without success.

String tmpStr = "href=\"http://www.abc.com\">design";

StringBuffer tmpStrBuff = new StringBuffer();
String[] tmpStrSpt = tmpStr.split(">");
if (tmpStrSpt[0].contains("abc.com")) {
    String[] tmpStrSpt1 = tmpStrSpt[0].split("\"");
    tmpStrBuff.append(tmpStrSpt1[0]);
    if (tmpStrSpt1[1].contains("?")) {
        tmpStrBuff.append("\"" + tmpStrSpt1[1] + "&s_cid=abcd_xyz\">");
    } else {
        tmpStrBuff.append("\"" + tmpStrSpt1[1] + "?s_cid=abcd_xyz\">");
    }
    tmpStrBuff.append(tmpStrSpt[1]);
    tmpStrBuff.append("</a>");
    System.out.println(" <p>tmpStr1:::: " + tmpStrBuff.toString() + "</p>");
}

the other approach used is :

String[] tmpTxtArr = text.split("\\s+");
StringBuffer tmpStrBuff = new StringBuffer();
for (String tmpTxt : tmpTxtArr) {
    descTxt += (tmpTxt.contains("abc.com") && !tmpTxt.contains("?")) ? tmpTxt
            .replace("\">", "?s_cid=" + trackingCode + "\">" + " ")
            : tmpTxt + " ";
}
Qantas 94 Heavy
  • 15,750
  • 31
  • 68
  • 83

1 Answers1

2

Description

This regex will:

  1. find the href attribute in anchor tags
  2. require the href to have http://abc.com. It'll also allow https and www.abc.com in their respective positions.
  3. if the string contains a ? then that will be captured too and placed into group capture 3

<a\b[^<]*\bhref=(['"])(https?:\/\/(?:www[.])?abc[.]com[^"'?]*?([?]?)[^"'?]*?)\1[^<]*<\/a>

enter image description here

Groups

Group 0 will have the entire anchor from the open <a to the close </a>. If you find this to be excessive or that it collides with nested anchor tags, then simply remove the [^<]*<\/a> from the end of this expression.

  1. gets the open quote which is back referenced later at \1 to ensure we have the same close quote
  2. gets the href value
  3. if there was a question mark then it's captured here

Java Code Example:

Given sample text:

<p>Some <a href="http://www.abc.com/content/cy-tech/global/en/cq5-reference-materials.s_cid_123.html">text</a>. I like kittens <a href="mailto:abc@gmail.com">email us</a>Dogs are nice.</p><a href="http://www.abc.com/content/cy-tech/global/en/cq5-reference-materials.s_cid_123.html?attribute=value">remember to vote</a>

This code

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("<a\\b[^<]*\\bhref=(['\"])(https?:\\/\\/(?:www[.])?abc[.]com[^\"'?]*?([?]?)[^\"'?]*?)\\1[^<]*<\\/a>",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

Yields

$matches Array:
(
    [0] => Array
        (
            [0] => <a href="http://www.abc.com/content/cy-tech/global/en/cq5-reference-materials.s_cid_123.html">text</a>
            [1] => <a href="http://www.abc.com/content/cy-tech/global/en/cq5-reference-materials.s_cid_123.html?attribute=value">remember to vote</a>
        )

    [1] => Array
        (
            [0] => "
            [1] => "
        )

    [2] => Array
        (
            [0] => http://www.abc.com/content/cy-tech/global/en/cq5-reference-materials.s_cid_123.html
            [1] => http://www.abc.com/content/cy-tech/global/en/cq5-reference-materials.s_cid_123.html?attribute=value
        )

    [3] => Array
        (
            [0] => 
            [1] => ?
        )

)

From here it's a simple matter of itterating through all the matches, if group 3 has a value then insert a & if not then insert a ? between your new text and the href value from group 2.

Disclaimer

Parsing HTML with regex may not be the easiest thing to maintain in the long run. However if you have control over your input text, the text remains pretty much uncomplicated, and you're willing to have the periodic edge case where a regular expresion might fail then regex will work for you.

Some haters will point out that strings like the following will not match properly. Although true, in HTML these possibilities are either illegal or impractical and therefore are not likely to be encountered.

  • <a href="http://abc.com?attrib=</a>">link</a> the extra special symbols < / and > to work in HTML they need to be escaped. As shown here this would violate the HTML standard.
  • <a href="http://abc.com?attrib=value">outside<a href="http://abc.com?attrib=value2">inside</a></a> the nested link may be legal however it forces the browser to choose which anchor tag is followed, and I've never seen this format used.
Community
  • 1
  • 1
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • +1 for the disclaimer – jpaugh Jun 08 '13 at 16:12
  • How did you create that state chart, if I may ask? – qqilihq Jun 08 '13 at 19:14
  • http://stackoverflow.com/a/1732454/1294162 – wazy Jun 08 '13 at 19:18
  • @ wazy, I think that comment should really be applied to the question and not a possible solution which also has a disclaimer supporting the idea behind your link. – Ro Yo Mi Jun 08 '13 at 20:22
  • @ qqilihq, I'm using debuggex.com. Although it doesn't support lookbehinds or atomic groups it's still handy for understanding the expression flow. There is also regexper.com. They do a pretty good job too, but it's not real time as you're typing. – Ro Yo Mi Jun 08 '13 at 20:23