String manipulation - Rich text editor

Question

I have a requirement. I have a string which has a value for eg:

<p>We are pleased <a href="http://www.anc.com/content/cy-tech/global/en/cq5-reference-materials.s_cid_123.html">to present the new product type</a>. This new product type is the best thing since sliced bread. We are pleased to present the new product type. This new product <a href="mailto:abc@gmail.com">type is the best</a> thing since sliced bread.</p>

The above text will be stored as a single string value. I need to append certain parameters to the hrefs after checking the criteria. Let me know how to extract only the href and append the parameter and display the string without damage (FYI : the string is the value entered through RTE - rich text editor)

Tried this approach but without success.

String tmpStr = "href=\"http://www.abc.com\">design";

StringBuffer tmpStrBuff = new StringBuffer();
String[] tmpStrSpt = tmpStr.split(">");
if (tmpStrSpt[0].contains("abc.com")) {
    String[] tmpStrSpt1 = tmpStrSpt[0].split("\"");
    tmpStrBuff.append(tmpStrSpt1[0]);
    if (tmpStrSpt1[1].contains("?")) {
        tmpStrBuff.append("\"" + tmpStrSpt1[1] + "&s_cid=abcd_xyz\">");
    } else {
        tmpStrBuff.append("\"" + tmpStrSpt1[1] + "?s_cid=abcd_xyz\">");
    }
    tmpStrBuff.append(tmpStrSpt[1]);
    tmpStrBuff.append("</a>");
    System.out.println(" <p>tmpStr1:::: " + tmpStrBuff.toString() + "</p>");
}

the other approach used is :

String[] tmpTxtArr = text.split("\\s+");
StringBuffer tmpStrBuff = new StringBuffer();
for (String tmpTxt : tmpTxtArr) {
    descTxt += (tmpTxt.contains("abc.com") && !tmpTxt.contains("?")) ? tmpTxt
            .replace("\">", "?s_cid=" + trackingCode + "\">" + " ")
            : tmpTxt + " ";
}

Well at least you should discard regex. Parsing markup language with regex is not a very good idea. Also why Javascript? — Mena, Jun 08 '13 at 13:02
@user1661908 Add both approaches in question and remove comments with them. This will make it easier to read and all users will be able to read them in question (not everyone is reading comments). — Pshemo, Jun 08 '13 at 13:17

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

Description

This regex will:

find the href attribute in anchor tags
require the href to have http://abc.com. It'll also allow https and www.abc.com in their respective positions.
if the string contains a ? then that will be captured too and placed into group capture 3

<a\b[^<]*\bhref=(['"])(https?:\/\/(?:www[.])?abc[.]com[^"'?]*?([?]?)[^"'?]*?)\1[^<]*<\/a>

enter image description here

Groups

Group 0 will have the entire anchor from the open <a to the close </a>. If you find this to be excessive or that it collides with nested anchor tags, then simply remove the [^<]*<\/a> from the end of this expression.

gets the open quote which is back referenced later at \1 to ensure we have the same close quote
gets the href value
if there was a question mark then it's captured here

Java Code Example:

Given sample text:

<p>Some <a href="http://www.abc.com/content/cy-tech/global/en/cq5-reference-materials.s_cid_123.html">text</a>. I like kittens <a href="mailto:abc@gmail.com">email us</a>Dogs are nice.</p><a href="http://www.abc.com/content/cy-tech/global/en/cq5-reference-materials.s_cid_123.html?attribute=value">remember to vote</a>

This code

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("<a\\b[^<]*\\bhref=(['\"])(https?:\\/\\/(?:www[.])?abc[.]com[^\"'?]*?([?]?)[^\"'?]*?)\\1[^<]*<\\/a>",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

Yields

$matches Array:
(
    [0] => Array
        (
            [0] => <a href="http://www.abc.com/content/cy-tech/global/en/cq5-reference-materials.s_cid_123.html">text</a>
            [1] => <a href="http://www.abc.com/content/cy-tech/global/en/cq5-reference-materials.s_cid_123.html?attribute=value">remember to vote</a>
        )

    [1] => Array
        (
            [0] => "
            [1] => "
        )

    [2] => Array
        (
            [0] => http://www.abc.com/content/cy-tech/global/en/cq5-reference-materials.s_cid_123.html
            [1] => http://www.abc.com/content/cy-tech/global/en/cq5-reference-materials.s_cid_123.html?attribute=value
        )

    [3] => Array
        (
            [0] => 
            [1] => ?
        )

)

From here it's a simple matter of itterating through all the matches, if group 3 has a value then insert a & if not then insert a ? between your new text and the href value from group 2.

Disclaimer

Parsing HTML with regex may not be the easiest thing to maintain in the long run. However if you have control over your input text, the text remains pretty much uncomplicated, and you're willing to have the periodic edge case where a regular expresion might fail then regex will work for you.

Some haters will point out that strings like the following will not match properly. Although true, in HTML these possibilities are either illegal or impractical and therefore are not likely to be encountered.

<a href="http://abc.com?attrib=</a>">link</a> the extra special symbols < / and > to work in HTML they need to be escaped. As shown here this would violate the HTML standard.
<a href="http://abc.com?attrib=value">outside<a href="http://abc.com?attrib=value2">inside</a></a> the nested link may be legal however it forces the browser to choose which anchor tag is followed, and I've never seen this format used.

@ wazy, I think that comment should really be applied to the question and not a possible solution which also has a disclaimer supporting the idea behind your link. — Ro Yo Mi, Jun 08 '13 at 20:22
@ qqilihq, I'm using debuggex.com. Although it doesn't support lookbehinds or atomic groups it's still handy for understanding the expression flow. There is also regexper.com. They do a pretty good job too, but it's not real time as you're typing. — Ro Yo Mi, Jun 08 '13 at 20:23

String manipulation - Rich text editor

1 Answers1

Description

Groups

Java Code Example:

Disclaimer