I have a bunch of HTML files. In these files I need to correct the src
attribute of the IMG tags.
The IMG tags look typically like this:
<img alt="" src="./Suitbert_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />`
where the attributes are NOT in any specific order.
I need to remove the dot and the forward slash at the beginning of the src
attribute of the IMG tags so they look like this:
<img alt="" src="Suitbert%20%E2%80%93%20Wikipedia_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />
I have the following class so far:
import java.util.regex.*;
public class Replacer {
// this PATTERN should find all img tags with 0 or more attributes before the src-attribute
private static final String PATTERN = "<img\\.*\\ssrc=\"\\./";
private static final String REPLACEMENT = "<img\\.*\\ssrc=\"";
private static final Pattern COMPILED_PATTERN = Pattern.compile(PATTERN, Pattern.CASE_INSENSITIVE);
public static void findMatches(String html){
Matcher matcher = COMPILED_PATTERN.matcher(html);
// Check all occurance
System.out.println("------------------------");
System.out.println("Following Matches found:");
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
System.out.println(matcher.group());
}
System.out.println("------------------------");
}
public static String replaceMatches(String html){
//Pattern replace = Pattern.compile("\\s+");
Matcher matcher = COMPILED_PATTERN.matcher(html);
html = matcher.replaceAll(REPLACEMENT);
return html;
}
}
So, my method findMatches(String html)
seems to find correctly all IMG tags where the src
attributes starts with ./
.
Now my method replaceMatches(String html)
does not correctly replace the matches.
I am a newbie to regex, but I assume that either the REPLACEMENT regex is incorrect or the usage of the replaceAll method or both.
A you can see, the replacement String contains 2 parts which are identical in all IMG tags:
<img
and src="./
. In between these 2 parts, there should be the 0 or more HTML attributes from the original string.
How do I formulate such a REPLACEMENT string?
Can somebody please enlighten me?