I'm new to jsoup and I'm having some difficulty working with non-HTML elements (scripts). I have the following HTML:
<$if not dcSnippet$>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="generator" content="Outside In HTML Converter version 8.4.0"/>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title></title>
</head>
<$endif$>
<div style="position:relative">
<p style="text-align: left; font-family: times; font-size: 10pt; font-weight: normal; font-style: normal; text-decoration: none"><span style="font-weight: normal; font-style: normal">This is a test document.</span></p>
</div>
<$if not dcSnippet$>
</body>
</html>
<$endif$>
The application used to display this knows what to do with those <if dcSnippet$> and etc. statements. So, when I simply parse the text with jsoup, the < and > are encoded and the html is reorganized, so it doesn't execute or display properly. Like so:
<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><$if not dcSnippet$>
<meta http-equiv="generator" content="Outside In HTML Converter version 8.4.0">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
<$endif$>
<div style="position:relative">
<p style="text-align: left; font-family: times; font-size: 10pt; font-weight: normal; font-style: normal; text-decoration: none"><span style="font-weight: normal; font-style: normal">This is a test document.</span></p>
</div>
<$if not dcSnippet$>
<$endif$>
</body></html>
My end goal here is I want to add some css and js includes, and modify a couple of the element attributes. That's not really a problem, I have that much worked out. The problem is I don't know how to preserve the non-HTML elements and keep the formatting in the same place as the original. My solution so far goes like this:
- Read in the HTML file, and iterate through it, removing the lines with the non-html elements.
- Create a Document object with the pure HTML
- Make my modifications
- Go back through the HTML and re-insert the non-HTML elements (scripts) that I removed first.
- Save the document out to the filesystem
This works for now, as long as the placement of the non-HTML is predictable, and so far it is. But I want to know if there's a better way to do this so I don't have to 'clean' the HTML first, then manually re-introduce what I removed later. Here's the gist of my code (hopefully I didn't miss too many declarations):
String newLine();
FileReader fr = new FileReader(inputFile);
BufferedReader br = new BufferedReader(fr);
while ((thisLine = br.readLine()) != null) {
if (thisLine.matches(".*<\\$if.*\\$>")) {
ifStatement = thisLine + "\n";
} else if (thisLine.matches(".*<\\$endif\\$>")) {
endifStatement = thisLine + "\n";
} else {
tempHtml += thisLine + "\n";
}
}
br.close();
Document doc = Jsoup.parse(tempHtml, "UTF-8");
doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended);
Element head = doc.head();
Element body = doc.body();
Element firstDiv = body.select("div").first();
[... perform my element and attribute inserts ...]
body.prependText("\n" + endifStatement);
body.appendText("\n" + ifStatement);
String fullHtml = (ifStatement + doc.toString().replaceAll("\\<", "<").replaceAll("\\>", ">") + "\n" + endifStatement);
BufferedWriter htmlWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8"));
htmlWriter.write(fullHtml);
htmlWriter.flush();
htmlWriter.close();
Thanks so much for any help or input!