Java replace content in a link

Question

I need to read the html of a webpage, then find the links and images, then rename the links and images, what i have done

reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), 'UTF-8'));  
String line;  
while ((line = reader.readLine()) != null) { 
    regex = "<a[^>]*href=(\"([^\"]*)\"|\'([^\']*)\'|([^\\s>]*))[^>]*>(.*?)</a>";  
    final Pattern pa = Pattern.compile(regex, Pattern.DOTALL);  
    final Matcher ma = pa.matcher(s);  
    if(ma.find()){  
        string newlink=path+"1-2.html";
        //replace the link in href with newlink, how can i do this?
    }  
    html.append(line).append("/r/n");  
}

how can i do the comment part

You DO NOT WANT (!!!!) to parse HTML with RegEx! Use some HTML/XML parser instead! — Dominik Sandjaja, Sep 26 '12 at 07:25
And here are the links to prove my point: http://stackoverflow.com/questions/8577060/why-is-it-such-a-bad-idea-to-parse-xml-with-regex http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html — Dominik Sandjaja, Sep 26 '12 at 07:26

score 0 · Answer 1 · answered Sep 26 '12 at 07:37

0

Using regex for parsing HTML can be difficult and unreliable. It's better to use XPath and DOM manipulation for things like that.

answered Sep 26 '12 at 07:37

Adam Dyga

8,666
4
27
35

Not even! Do not reinvent the wheel. Almost on every case, there is a solution for a common problem, and this is not the exception. – Luiggi Mendoza Sep 26 '12 at 07:48
how can i do that, i am new to java – onegun Sep 26 '12 at 08:00

Joop Eggen · Answer 2 · 2013-09-12T10:19:19.447

Alternatives were mentioned, nevertheless:

Matcher has support to do a "replace all" using a StringBuffer.
The matched text must partially be readded as replacement text, hence all must be in ma.group(1) (2, 3, ...).
DOTALL would let . match newline chars, not needed as using readLine which strips the line end.
There could be more than one link per line.
You had a matcher(s) instead of matcher(line) in the example code.

So the code uses Matcher.appendReplacement and appendTail.

StringBuffer html = new StringBuffer();
reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), 'UTF-8'));  
String line;  
regex = "(<a[^>]*href=)(\"([^\"]*)\"|\'([^\']*)\'|([^\\s>]*))[^>]*>(.*?)(</a>)";  
final Pattern pa = Pattern.compile(regex);
while ((line = reader.readLine()) != null) {
    final Matcher ma = pa.matcher(line);
    while (ma.find()) {
        string newlink=path+"1-2.html";
        ma.appendReplacement(html, m.group(1) /* a href */ + ...);
    }
    ma.appendTail(html);
    html.append(line).append("/r/n");  
}

Java replace content in a link

2 Answers2