I'm supposed to parse an html page and display some search results from that page, i have seen some codes that do parsing, but they all do parsing on XML files. I have tried to convert the html file into XML file to parse it, but it didn't work. My guess was that because it contains some java script. I have Googled how to remove java scripts from HTML files but the result were somehow related to security and i didn't understand what i should do. Also i have searched in similar questions here , they mentioned something called Jtidy and DeXSS, but also i didn't get how these are supposed to help me remove the script from the html page so that i can convert it to XML
The code i'm using to convert html to XML is this
InputStream isInHtml =null;
URL url = null;
URLConnection connection =null;
DataInputStream disInHtml =null;
FileOutputStream fosOutHtml =null;
FileWriter fwOutXml =null;
FileReader frInHtml=null;
BufferedWriter bwOutXml =null;
BufferedReader brInHtml=null;
try {
frInHtml = new FileReader("./Lib.html");
brInHtml = new BufferedReader(frInHtml);
SAXBuilder saxBuilder = new SAXBuilder();
Document jdomDocument = saxBuilder.build(brInHtml);
XMLOutputter outputter = new XMLOutputter();
try {
outputter.output(jdomDocument, System.out);
fwOutXml = new FileWriter("./Lib.xml");
bwOutXml = new BufferedWriter(fwOutXml);
outputter.output(jdomDocument, bwOutXml);
System.out.flush();
}
catch (IOException e) {}
}
catch (IOException e) {}
finally {
System.out.flush();
try{
isInHtml.close();
disInHtml.close();
fosOutHtml.flush();
fosOutHtml.getFD().sync();
fosOutHtml.close();
fwOutXml.flush();
fwOutXml.close();
bwOutXml.close();
}
catch(Exception w) {}