1

I'm trying to parse an XML file with Java. Before I start parsing, I need to replace (encode) some text between the <code> and </code> tags.

Therefore I read the contents of the file into a String:

File xml = new File(this.xmlFileName);
final BufferedReader reader = new BufferedReader(new FileReader(xml));
final StringBuilder contents = new StringBuilder();
while (reader.ready()) {
    contents.append(reader.readLine());
}
reader.close();
final String stringContents = contents.toString();

After I readed the XML into the string, I encode the values using Pattern and Matcher:

StringBuffer sb = new StringBuffer();
Pattern p = Pattern.compile("<code>(.*?)</code>", Pattern.DOTALL);
Matcher m = p.matcher(stringContents);
while (m.find()) {
    //Encode text between <code> and </code> tags
    String valueFromTags = m.group(1);
    byte[] decodedBytes = valueFromTags.getBytes();
    new Base64();
    String encodedBytes = Base64.encodeBase64String(decodedBytes);
    m.appendReplacement(sb, "<code>" + encodedBytes + "</code>");
}
m.appendTail(sb);
String result = sb.toString();

After the replacements are done, I try to read this String into the XML parser:

DocumentBuilderFactory dbFactory = DocumentBuilderFactory
        .newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(result);
doc.getDocumentElement().normalize();

But then I get this error: java.net.MalformedURLException: no protocol: <root> <application> <interface>...

As you can see, after I read the File into a String, for some reasons there are a lot of spaces added, where there were newlines or tabs in the original file. So I think that's the reason why I get this error. Is there any way I can solve this?

Kaj
  • 2,445
  • 3
  • 23
  • 34
  • 1
    [Obligatory link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). I think this is a prime example of why you should **never** do this. – Boris the Spider May 15 '14 at 23:21
  • What would be the right way to encode text between the and tags then? Because I can't parse it before encoding it, it contains special characters like < and > and the parser will give errors because of that. But note that the problem that the parser can't parse the xml in my example has something to do with the way how I read it into a String using BufferedReader. The spaces are already there before the regex changement. – Kaj May 15 '14 at 23:25
  • Well, you don't have valid XML then. Find some valid XML. – Boris the Spider May 15 '14 at 23:26
  • If I let the parser read the File object instead of the String, readed in by BufferedReader, the parser works and there aren't any errors. So the XML is valid. But in order to do the replacements I have to read it into a String first and that's where it goes wrong. – Kaj May 15 '14 at 23:29

1 Answers1

0

I think you still need to check that readLine has not returned a null.

while ((line = reader.readLine()) != null) {
   contents.append(line)
}
sanz
  • 1,232
  • 1
  • 9
  • 17