1

I've a file which contains self closing anchor tags

  <p><a name="impact"/><span class="sectiontitle">Impact</span></p>
<p><a name="Summary"/><span class="sectiontitle">Summary</span></p>

i want to correct the tags like below

    <p><a name="impact"><span class="sectiontitle">Impact</span></a></p>
<p><a name="Summary"><span class="sectiontitle">Summary</span></a></p>

I've written this code to find and replace incorrect anchor tags

   package mypack;
import java.io.*;
import java.util.regex.*;


public class AnchorIssue {

    static int count=0;
    public static void main(String[] args) throws IOException {
        Pattern pFinder = Pattern.compile("<a name=\\\".*\\\"(\\/)>(.*)(<)");
        BufferedReader r = new BufferedReader
                  (new FileReader("D:/file.txt"));
                  String line;
                  while ((line =r.readLine()) != null) {
                     Matcher m1= pFinder.matcher(line);
                     while (m1.find()) {
                        int start = m1.start(0);
                        int end = m1.end(0);
                        ++count;

//                  Use CharacterIterator.substring(offset, end);
                        String actual=line.substring(start, end);
                        System.out.println(count+"."+"Actual String :-"+actual);

                         actual.replace(m1.group(1),"");
                         System.out.println(actual);
                         actual.replaceAll(m1.group(3),"</a><");
                         System.out.println(actual);

//              Use CharacterIterator.substring(offset, end);
                    System.out.println(count+"."+"Replaced"+actual);


      }

} 
    r.close();            
    }
}

The above code returns the correct number of self-closing anchor tags in file but the replace code is not working properly.

skr
  • 1,700
  • 1
  • 15
  • 39
  • 9
    Hmm, there is good answer about trying to parse HTML with regex in this post: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Adrien Brunelat Apr 18 '16 at 13:25
  • This is generally true. However, in some edge cases, like this one, you can use regexes to help you editing a file. Also, some regex engines can deal with recursion. I know, that's theoretically not regular anymore. Still, they are regexes. This is my writeup about this topic: https://tamasrev.wordpress.com/2014/06/11/recursive-regular-expressions/ – Tamas Rev Apr 18 '16 at 15:40

3 Answers3

0

Your problem is greediness. I.e. the .*" will match everything up to the last " in that line. There are two fixes for this. Both fixes are about to replace this line:

Pattern pFinder = Pattern.compile("<a name=\\\".*\\\"(\\/)>(.*)(<)");

Option one: use a negated character class:

Pattern pFinder = Pattern.compile("<a name=\\\"[^\\"]*\\\"(\\/)>(.*)(<)");

Option two: use lazy repetitor:

Pattern pFinder = Pattern.compile("<a name=\\\".*?\\\"(\\/)>(.*)(<)");

See more here.

Tamas Rev
  • 7,008
  • 5
  • 32
  • 49
0

Since the file structure seems "constant", it might be better to simplify the problem to a matter of simple replaces as opposed to complex html matching. It seems to me that you're not really interested in the content of the anchor tag, so just replace /><span with ><span and </span></p> with </span></a></p>.

CptBartender
  • 1,235
  • 8
  • 22
0

Using below code i'm able to find and replace all self closed anchor tags.

    package mypack;
import java.io.*;
import java.util.regex.*;


public class AnchorIssue {

    static int count=0;
    public static void main(String[] args) throws IOException {
        Pattern pFinder = Pattern.compile("<a name=\\\".*?\\\"(\\/><span)(.*)(<\\/span>)");
        BufferedReader r = new BufferedReader
                  (new FileReader("file.txt"));
                  String line;
                  while ((line =r.readLine()) != null) {
                     Matcher m1= pFinder.matcher(line);
                     while (m1.find()) {
                        int start = m1.start(0);
                        int end = m1.end(0);
                        ++count;

//                  Use CharacterIterator.substring(offset, end);
                        String actual=line.substring(start, end);
                        System.out.println(count+"."+"Actual String : "+actual);


                        actual= actual.replaceAll(m1.group(1),"><span");
                     System.out.println("\n");

                        actual= actual.replaceAll(m1.group(3),"</span></a>");

                    System.out.println(count+"."+"Replaced : "+actual);
                    System.out.println("\n");
                    System.out.println("---------------------------------------------------");


      }

} 
    r.close();            
    }
}
skr
  • 1,700
  • 1
  • 15
  • 39