how to read htm file inside a zip file?

Question

I have a zip file which contains Index.htm. I should read content of Index.htm and find a date(December 2011) inside it and create a directory with this date and then extract zip file inside this directory.

this is html file:

<HTML>    
  <HEAD></HEAD>    
  <BODY>    
  <A Name="TopOfPage"></A>    
  <TABLE Width="100%" Border="0" CellPadding="0" CellSpacing="0">    
   <TR> 
     <TD Width="30%"><A HRef="HeaderTxt/HetBCFI.htm">Het B.C.F.I.</A></TD>    
   </TR>      
  </TABLE>    
  <TABLE Width="100%" Border="0" CellPadding="0" CellSpacing="0">
   <TR> 
    <TD RowSpan="2" Width="10"></TD>
    <TD Width="70%"><STRONG><FONT Face="Arial" Size="2">Gecommentarieerd   Geneesmiddelenrepertorium</FONT></STRONG></TD> 
    <TD Width="29%" Align="Right" Class= "Datum">&nbsp;
   December 2011&nbsp;&nbsp;
  </TD>
  <TD Rowspan="2" Width="10"></TD>
 </TR>
</TABLE> </BODY> </HTML>

score 3 · Answer 1 · edited May 23 '17 at 12:12

3

Try this,

Use java.util.zip package to read the html
Use Some html parser (I would recommend JSoup) to get the date string. Here is link that would help in your case.

Once you have the date string, create the dir you wanted.

EDIT: To remove  ,you could one of the followings,

Create another document element with the string containing   and do the following

document.select(":containsOwn(\u00a0)").remove(); (taken from here)
Use Following (Assuming your string to be cleaned is htmlString)

Jsoup.parse(htmlString).text();
Use String's replaceAll() function to get rid of  .

edited May 23 '17 at 12:12

Community

1
1

answered Jan 16 '12 at 15:18

Santosh

17,667
4
54
79

When i pars htm file i get befor and end of my string. How can i get rid of it? – michdraft Jan 18 '12 at 12:04
Updated my answer to address your concerns. – Santosh Jan 18 '12 at 12:44
`String date = doc.body().getElementsByClass("Datum").html().toString().replaceAll(" ","").trim();` – michdraft Jan 18 '12 at 14:14

score 2 · Answer 2 · answered Jan 16 '12 at 15:17

2

Several steps:

Use the java.util.zip package and create a decompressed stream.
Use an XML parser (like JSoup) to walk the nodes, and...
Use a regex, or a regex with a date parser (such as SimpleDateFormat) to pick out the date.

This makes the assumptions that the date you're looking for is always in a text node.

answered Jan 16 '12 at 15:17

Extra step 1.5: ZipFile zp = new ZipFile ("xxx.zip"); InputStream in = zp.getInputStream (zp.getEntry ("Index.htm")); – Andrei LED Jan 16 '12 at 15:21
I get this result: ` December 2011 ` How can i omit in my string? – michdraft Jan 18 '12 at 12:03

score 1 · Accepted Answer · answered Jan 18 '12 at 10:43

This is proper end Code that i used: thanks to all you by providing usefull tips

public static String getDateWithinHtmlInsideZipFile(File archive) {
      ZipFile zp = new ZipFile(archive);
      InputStream in = zp.getInputStream (zp.getEntry ("Index.htm"));

      Document doc = Jsoup.parse(in, "UTF-8", "");

    return doc.body().getElementsByClass("Datum").text().trim();
}

how to read htm file inside a zip file?

3 Answers3