3

I have a bunch of HTML files. In these files I need to correct the src attribute of the IMG tags. The IMG tags look typically like this:

<img alt="" src="./Suitbert_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />`

where the attributes are NOT in any specific order. I need to remove the dot and the forward slash at the beginning of the src attribute of the IMG tags so they look like this:

<img alt="" src="Suitbert%20%E2%80%93%20Wikipedia_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />

I have the following class so far:

import java.util.regex.*;


public class Replacer {

    // this PATTERN should find all img tags with 0 or more attributes before the src-attribute
    private static final String PATTERN = "<img\\.*\\ssrc=\"\\./";
    private static final String REPLACEMENT = "<img\\.*\\ssrc=\"";
    private static final Pattern COMPILED_PATTERN = Pattern.compile(PATTERN,  Pattern.CASE_INSENSITIVE);


    public static void findMatches(String html){
        Matcher matcher = COMPILED_PATTERN.matcher(html);
        // Check all occurance
        System.out.println("------------------------");
        System.out.println("Following Matches found:");
        while (matcher.find()) {
            System.out.print("Start index: " + matcher.start());
            System.out.print(" End index: " + matcher.end() + " ");
            System.out.println(matcher.group());
        }
        System.out.println("------------------------");
    }

    public static String replaceMatches(String html){
        //Pattern replace = Pattern.compile("\\s+");
        Matcher matcher = COMPILED_PATTERN.matcher(html);
        html = matcher.replaceAll(REPLACEMENT);
        return html;
    }
}

So, my method findMatches(String html) seems to find correctly all IMG tags where the src attributes starts with ./.

Now my method replaceMatches(String html) does not correctly replace the matches. I am a newbie to regex, but I assume that either the REPLACEMENT regex is incorrect or the usage of the replaceAll method or both. A you can see, the replacement String contains 2 parts which are identical in all IMG tags: <img and src="./. In between these 2 parts, there should be the 0 or more HTML attributes from the original string. How do I formulate such a REPLACEMENT string? Can somebody please enlighten me?

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
mrd
  • 4,561
  • 10
  • 54
  • 92
  • are you calling the `replaceMatches()` method? – driangle Feb 14 '12 at 22:36
  • 1
    Why not do this with javascript? It would be pretty simple to iterate through the img collection and then remove the ./ from the beginning of each .src if it was there. – Travis J Feb 14 '12 at 22:39
  • Why would you do this in Java, and not using, say, sed, or an IDE/editor that does search/replace across files? Right tool for the job, and this is not something that makes sense to do in Java. – Dave Newton Feb 14 '12 at 22:46
  • 1
    @TravisJ Because doing it in JavaScript is working around the problem instead of fixing it. – Dave Newton Feb 14 '12 at 22:47
  • @Dave: if I do this in Eclipse, I will still need to know the correct REPLACEMENT regex. – mrd Feb 14 '12 at 23:41
  • @Travis: I need to have the html files corrected as files at development time. Not later, when they are opened in a browser. (In my case an Android WebView, and support for any manipulation of this kind is buggy and incomplete in the lower API levels). – mrd Feb 14 '12 at 23:46
  • @mradlmaier (Dave too) - I believe that I did not explain enough. What I meant was to use a parser in Java for Javascript (usable in Eclipse or IDE). This one is free: http://lobobrowser.org/cobra/java-html-parser.jsp – Travis J Feb 14 '12 at 23:55
  • @Travis J, Bozho: My little program will only be used by me to save me a lot work, therefore using any parser seems to be overkill. My litle program will never be released to the public. – mrd Feb 15 '12 at 00:05

4 Answers4

7

Don't use regex for HTML. Use a parser, obtain the src attribute and replace it.

Bozho
  • 588,226
  • 146
  • 1,060
  • 1,140
  • @ggreiner: yes I do, from a different class like Replacer.replacesMatches(html) – mrd Feb 14 '12 at 23:48
  • I should add: when I examine the html output files, the replaced tags look like this: – mrd Feb 14 '12 at 23:51
  • As you can see completely messed up, so replacemt takes place but incorrectly – mrd Feb 14 '12 at 23:52
  • IMO if you're simply searching for a pretty specific thing, and it's pretty controlled like this, a regex is fine. In this case it'd be the first thing I'd try. That said, I already have directory-based XML-like search/replace tools, so if it didn't succeed essentially immediately, I'd use those. – Dave Newton Feb 14 '12 at 23:54
5

Try these:

PATTERN = "(<img[^>]*\\ssrc=\")\\./"
REPLACEMENT = "$1"

Basically, you capture everything except the ./ in group #1, then plug it back in using the $1 placeholder, effectively stripping off the ./.

Notice how I changed your .* to [^>]*, too. If there happened to be two IMG tags on the same line, like this:

<img src="good" /><img src="./bad" />

...your regex would match this:

<img src="good" /><img src="./

It would do that even if you used a non-greedy .*?. [^>]* makes sure the match is always contained within the one tag.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • Great, thx, this does the trick. And finally I understand how this thingie with $-sign and its use in the REPLACEMENT string works. – mrd Feb 15 '12 at 14:47
  • So this is the final solution, kudos to Alan More and Guillaume Polet:` private static final String PATTERN = "(]*\\ssrc=\")\\./"; private static final String REPLACEMENT = "$1";` – mrd Feb 15 '12 at 14:53
1

Your replacement is incorrect. It will replace the matched string by the replacement (not interpreted as a regexp). If you want to achieve, what you want, you need to use groups. A group is delimited by the parenthesis of the regexp. Each opening parenthesis indicates a new group. You can use $i in the replacement string to reproduce what a groupe has matched and where 'i' is your group number reference. See The doc of appendReplacement for the details.

// Here is an example (it looks a bit like your case but not exactly)
String input = "<img name=\"foobar\" src=\"img.png\">";
String regexp = "<img(.+)src=\"[^\"]+\"(.*)>";
Matcher m = Pattern.compile(regexp).matcher(input);
StringBuffer sb = new StringBuffer();
while(m.find()) {
    // Found a match!
    // Append all chars before the match and then replaces the match by the 
    // replacement (the replacement refers to group 1 & 2 with $1 & $2
    // which match respectively everything between '<img' and 'src' and,
    // everything after the src value and the closing >
    m.appendReplacement(sb, "<img$1src=\"something else\"$2>";
}
m.appendTail(sb);// No more match, we append the end of input

Hope this helps you

Guillaume Polet
  • 47,259
  • 4
  • 83
  • 117
  • I already suspected that and looked at appendReplacement. But I am confused about how to do that. Any link to an example or tutorial would be helpful – mrd Feb 14 '12 at 23:58
  • There's no need to resort to `appendReplacement()` and `appendTail()` here (though it's certainly good to know about them). `replaceAll()` is perfectly capable of handling this job, as I demonstrated in my answer. – Alan Moore Feb 15 '12 at 07:55
  • Yes, I was only providing an example to the previous comment. – Guillaume Polet Feb 15 '12 at 08:29
  • @GuillaumePolet: Thx, yours and Alan's post above did enlight me and solved the problem. Very Interesting and exactly what I was looking for – mrd Feb 15 '12 at 14:50
0

If src attributes only occur in your HTML within img tags, you can just do this:

input.replace("src=\"./", "src=\"")

You could also do this without java by using sed if you're using a *nix OS

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • I already had that idea, but there is no garantee that the src attribute only occurs in IMG tags. In particular, the src attribute is valid for quite a lot HTML tags, so that's a pretty unpredictable approach. – mrd Feb 14 '12 at 23:54