2

I have a Java String variable containing HTML in which I want to replace all the names of PNG images by another name.

Example input HTML

<html>
  <head>
    <link rel="stylesheet" media="screen" href="style.css"/>
  </head>
  <body>
    <img href="test1.png" />
    <img href="test2.png" />
  </body>
</html>

Typical output HTML should be

<html>
  <head>
    <link rel="stylesheet" media="screen" href="style.css"/>
  </head>
  <body>
    <img href="C:\foo\bar\test1.png" />
    <img href="C:\foo\bar\test2.png" />
  </body>
</html>

Currently I have this Java code that provides me the new name by loading the image as a ressource. However I can't find the good regex to select all (and only) the images names (with extension but without quotes), can anyone help me on that ?

Pattern imagePattern = Pattern.compile(" TODO ");
Matcher imageMatcher = imagePattern.matcher(taskHTML);

while (imageMatcher.find())
{
    String oldName = imageMatcher.group(1);
    String newName = "" + getClass().getResource("/images/" + imageMatcher.group(1));

    taskHTML.replace(oldName, newName);
}

The matcher should list the following elements:

[test1.png, test2.png]
Spotted
  • 4,021
  • 17
  • 33

4 Answers4

1

Like others have mentioned, I suggest you use an HTML parser like JSoup.

Usage:

import org.jsoup.nodes.*;
import org.jsoup.select.Elements;
import org.jsoup.Jsoup;

public class Parse {

    public static void main(String[] args) {
        String webPage = "<img href=\"test1.png\" /><img href=\"test2.png\" />"; //your HTML

        Document doc = Jsoup.parse(webPage);

        Elements imgLinks = doc.select("img[href]"); //grabs all imgLinks

        //for every <img> link
        for(Element link : imgLinks){           
            String imageName = link.attr("href"); //grab current href (your image name)
        link.attr("href", "C:\\foo\\bar\\" + imageName); //replace current href with the dir + imageName

        }
        System.out.println(doc.html()); //print modified HTML
    }
} 

Output:

<html>
    <head>
        <link rel="stylesheet" media="screen" href="style.css">
    </head>
    <body>
        <img href="C:\foo\bar\test1.png"> 
        <img href="C:\foo\bar\test2.png">
    </body>
</html>

If you have a local HTML file that you want to parse, you will want to replace the doc above with this:

File in = new File(input);
Document doc = JSoup.parse(in, null);

Or if you want to directly connect to a page you can replace it with this:

Document doc = Jsoup.connect("http://stackoverflow.com/").get();

Note: You will need to add JSoup to your buildpath

benscabbia
  • 17,592
  • 13
  • 51
  • 62
  • Useful but I ended up using a regexp (see my answer). – Spotted Mar 24 '15 at 14:39
  • 1
    @Spotted glad you found a solution! Just a little pointer, it's not advised to use regex to parse html. For reasons read [this](http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la). Anyway, don't forget to accept an answer to close the question :) – benscabbia Mar 24 '15 at 14:45
0

try this

str = str.replaceAll("href=\"(.*?)\"", "href=\"" + dir.replace("\\", "\\\\") + "$1\"");
Evgeniy Dorofeev
  • 133,369
  • 30
  • 199
  • 275
0

Whether you need to modify HTML content consider using XSLT instead of REGEXP.

Igor Konoplyanko
  • 9,176
  • 6
  • 57
  • 100
0

I ended up using the following regular expression:

Pattern.compile("\\\"(.+\\.png)\\\"");

And accessing the match between the quotes by getting the second element of each match (the first is the string with the quotes):

matcher.group(1);
Spotted
  • 4,021
  • 17
  • 33