0

I need to read the html of a webpage, then find the links and images, then rename the links and images, what i have done

reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), 'UTF-8'));  
String line;  
while ((line = reader.readLine()) != null) { 
    regex = "<a[^>]*href=(\"([^\"]*)\"|\'([^\']*)\'|([^\\s>]*))[^>]*>(.*?)</a>";  
    final Pattern pa = Pattern.compile(regex, Pattern.DOTALL);  
    final Matcher ma = pa.matcher(s);  
    if(ma.find()){  
        string newlink=path+"1-2.html";
        //replace the link in href with newlink, how can i do this?
    }  
    html.append(line).append("/r/n");  
}  

how can i do the comment part

onegun
  • 803
  • 1
  • 10
  • 27

2 Answers2

0

Using regex for parsing HTML can be difficult and unreliable. It's better to use XPath and DOM manipulation for things like that.

Adam Dyga
  • 8,666
  • 4
  • 27
  • 35
0

Alternatives were mentioned, nevertheless:

  • Matcher has support to do a "replace all" using a StringBuffer.
  • The matched text must partially be readded as replacement text, hence all must be in ma.group(1) (2, 3, ...).
  • DOTALL would let . match newline chars, not needed as using readLine which strips the line end.
  • There could be more than one link per line.
  • You had a matcher(s) instead of matcher(line) in the example code.

So the code uses Matcher.appendReplacement and appendTail.

StringBuffer html = new StringBuffer();
reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), 'UTF-8'));  
String line;  
regex = "(<a[^>]*href=)(\"([^\"]*)\"|\'([^\']*)\'|([^\\s>]*))[^>]*>(.*?)(</a>)";  
final Pattern pa = Pattern.compile(regex);
while ((line = reader.readLine()) != null) {
    final Matcher ma = pa.matcher(line);
    while (ma.find()) {
        string newlink=path+"1-2.html";
        ma.appendReplacement(html, m.group(1) /* a href */ + ...);
    }
    ma.appendTail(html);
    html.append(line).append("/r/n");  
}
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138