1

I have a zip file which contains Index.htm. I should read content of Index.htm and find a date(December 2011) inside it and create a directory with this date and then extract zip file inside this directory.

this is html file:

<HTML>    
  <HEAD></HEAD>    
  <BODY>    
  <A Name="TopOfPage"></A>    
  <TABLE Width="100%" Border="0" CellPadding="0" CellSpacing="0">    
   <TR> 
     <TD Width="30%"><A HRef="HeaderTxt/HetBCFI.htm">Het B.C.F.I.</A></TD>    
   </TR>      
  </TABLE>    
  <TABLE Width="100%" Border="0" CellPadding="0" CellSpacing="0">
   <TR> 
    <TD RowSpan="2" Width="10"></TD>
    <TD Width="70%"><STRONG><FONT Face="Arial" Size="2">Gecommentarieerd   Geneesmiddelenrepertorium</FONT></STRONG></TD> 
    <TD Width="29%" Align="Right" Class= "Datum">&nbsp;
   December 2011&nbsp;&nbsp;
  </TD>
  <TD Rowspan="2" Width="10"></TD>
 </TR>
</TABLE> </BODY> </HTML>
michdraft
  • 556
  • 3
  • 11
  • 31

3 Answers3

3

Try this,

  1. Use java.util.zip package to read the html
  2. Use Some html parser (I would recommend JSoup) to get the date string. Here is link that would help in your case.

Once you have the date string, create the dir you wanted.

EDIT: To remove &nbsp;,you could one of the followings,

  • Create another document element with the string containing &nbsp; and do the following

    document.select(":containsOwn(\u00a0)").remove(); (taken from here)

  • Use Following (Assuming your string to be cleaned is htmlString)

    Jsoup.parse(htmlString).text();

  • Use String's replaceAll() function to get rid of &nbsp;.

Community
  • 1
  • 1
Santosh
  • 17,667
  • 4
  • 54
  • 79
2

Several steps:

  1. Use the java.util.zip package and create a decompressed stream.
  2. Use an XML parser (like JSoup) to walk the nodes, and...
  3. Use a regex, or a regex with a date parser (such as SimpleDateFormat) to pick out the date.

This makes the assumptions that the date you're looking for is always in a text node.

1

This is proper end Code that i used: thanks to all you by providing usefull tips

public static String getDateWithinHtmlInsideZipFile(File archive) {
      ZipFile zp = new ZipFile(archive);
      InputStream in = zp.getInputStream (zp.getEntry ("Index.htm"));

      Document doc = Jsoup.parse(in, "UTF-8", "");

    return doc.body().getElementsByClass("Datum").text().trim();
}
michdraft
  • 556
  • 3
  • 11
  • 31