0

I'm supposed to parse an html page and display some search results from that page, i have seen some codes that do parsing, but they all do parsing on XML files. I have tried to convert the html file into XML file to parse it, but it didn't work. My guess was that because it contains some java script. I have Googled how to remove java scripts from HTML files but the result were somehow related to security and i didn't understand what i should do. Also i have searched in similar questions here , they mentioned something called Jtidy and DeXSS, but also i didn't get how these are supposed to help me remove the script from the html page so that i can convert it to XML

The code i'm using to convert html to XML is this

InputStream isInHtml =null;
URL url  = null;
URLConnection connection =null;
DataInputStream disInHtml =null;
FileOutputStream fosOutHtml =null;
FileWriter fwOutXml =null;
FileReader frInHtml=null;
BufferedWriter bwOutXml =null;
BufferedReader brInHtml=null;

try {
    frInHtml = new FileReader("./Lib.html");
    brInHtml = new BufferedReader(frInHtml);
    SAXBuilder saxBuilder = new SAXBuilder();
    Document jdomDocument = saxBuilder.build(brInHtml);
    XMLOutputter outputter = new XMLOutputter();

    try {
        outputter.output(jdomDocument, System.out);
        fwOutXml = new FileWriter("./Lib.xml");
        bwOutXml = new BufferedWriter(fwOutXml);
        outputter.output(jdomDocument, bwOutXml);
        System.out.flush();
    }
    catch (IOException e)  {}        
}
catch (IOException e) {}  
finally {
    System.out.flush();
    try{
        isInHtml.close();
        disInHtml.close();                      
        fosOutHtml.flush();
        fosOutHtml.getFD().sync();
        fosOutHtml.close();
        fwOutXml.flush();
        fwOutXml.close();
        bwOutXml.close();
    }
    catch(Exception w) {}
pbaris
  • 4,525
  • 5
  • 37
  • 61
Alaa
  • 539
  • 3
  • 8
  • 29

0 Answers0