Parse elements from a html file (treating it as a text file) in java

Question

I have html file. I am trying to extract "table" content between two anchors.

Here is the sample html content:

<HTML>
<HEAD>
<TITLE>
Test Doc
</TITLE>
</HEAD>
<BODY LINK=#000000 VLINK=#000000 ALINK=#990000>

<A NAME = "linkTab0000"></A>
<TABLE CELLPADING=1 BORDER=2 WIDTH=100%>
<TR>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#a6caf0>
<B><FONT SIZE=3 COLOR=#000000 FACE='HP Simplified'>Test Entity</FONT></B></TD>
</TR>
</TABLE><TABLE CELLPADING=2 BORDER=2 WIDTH=100%>
<TR>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#64b1ff WIDTH=200>
<B><FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>Name</FONT></B></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#64b1ff WIDTH=200>
<B><FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>Datatype</FONT></B></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#64b1ff WIDTH=200>
<B><FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>Definition</FONT></B></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#64b1ff WIDTH=200>
<B><FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>Note</FONT></B></TD>
</TR>
<TR>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#ffffff>
<FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>test</FONT></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#ffffff>
<FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>test</FONT></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#ffffff>
<FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>test</FONT></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#ffffff>
<FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>&nbsp</FONT></TD>
</TR>
</TABLE>

<A NAME = "linkTab0001"></A>
<TABLE CELLPADING=1 BORDER=2 WIDTH=100%>
<TR>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#a6caf0>
<B><FONT SIZE=3 COLOR=#000000 FACE='HP Simplified'>Test Entity</FONT></B></TD>
</TR>
</TABLE><TABLE CELLPADING=2 BORDER=2 WIDTH=100%>
<TR>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#64b1ff WIDTH=200>
<B><FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>Name</FONT></B></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#64b1ff WIDTH=200>
<B><FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>Datatype</FONT></B></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#64b1ff WIDTH=200>
<B><FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>Definition</FONT></B></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#64b1ff WIDTH=200>
<B><FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>Note</FONT></B></TD>
</TR>
<TR>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#ffffff>
<FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>test</FONT></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#ffffff>
<FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>CHAR(18)</FONT></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#ffffff>
<FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>test</FONT></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#ffffff>
<FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>&nbsp</FONT></TD>
</TR>
</TABLE>

I want to extract the "TABLE" element between "<A NAME = "linkTAB....." elements.

Below is the code I am using:

     Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

     System.out.println(doc.html());

     String inputStr = doc.html();
     String link = "<a name = "+"linkTab";
     Pattern p = Pattern.compile("(\\b^"+link+"\\b)(.*?)(\\^"+link+"\\b)");
     Matcher m = p.matcher(inputStr);
     List<String> matches = new ArrayList<String>();
     while (m.find()) {
         matches.add(m.group());
     }

I also tried using bufferedreader but it ignores the String link = "

Please let me know of any suggestions.

Thanks

To ask a clarifying question. Are you trying to extract data while its running or are you just trying to read the file? If you are trying to extract data while its running you want it to send its data out in XML format. than java can read it. Here is a link for java xml http://stackoverflow.com/questions/428073/what-is-the-best-simplest-way-to-read-in-an-xml-file-in-java-application — fftk4323, Sep 05 '13 at 19:26
@user1040730 sorry, which `table` element would you like to extract? It's not clear from the question to me... — Katona, Sep 05 '13 at 19:43
you mean from this website? Be a little more specific than... Here — fftk4323, Sep 05 '13 at 20:05
Apologies! I meant the html content highlighted in the question. — user1040730, Sep 05 '13 at 21:00

score 1 · Accepted Answer · answered Sep 05 '13 at 20:56

Regex is not a good approache to parse XML files like that.

You better use XPath or the Inbuild CSS-like selectors.

This is how I solved it for your problem:

public static void main(String[] args) throws IOException {
    // Read your html into a string
    StringWriter writer = new StringWriter();
    IOUtils.copy(Main.class.getResourceAsStream("/so18644171/html.html"), writer);
    String theString = writer.toString();

    Document doc = Jsoup.parse(theString);

    // a[name^=linkTab] means: 
    // all a's having a attribute name, starting with "linkTab"
    Elements linkTabs = doc.select("a[name^=linkTab] + table");

    // "a[name^=linkTab] + table means: All tables followed by a[...]

    System.out.println(linkTabs);
}

This prints:

<table cellpading="1" border="2" width="100%"> 
 <tbody>
  <tr> 
   <td align="LEFT" valign="TOP" bgcolor="#a6caf0"> <b><font size="3" color="#000000" face="HP Simplified">Test Entity</font></b></td> 
  </tr> 
 </tbody>
</table>
<table cellpading="1" border="2" width="100%"> 
 <tbody>
  <tr> 
   <td align="LEFT" valign="TOP" bgcolor="#a6caf0"> <b><font size="3" color="#000000" face="HP Simplified">Test Entity</font></b></td> 
  </tr> 
 </tbody>
</table>

I've uploaded this example to:

https://github.com/d0x/questions/blob/master/stackoverflowPlayground/src/main/java/so18644171/Main.java

This is awesome. Thanks! I am not strong in CSS and couldn't find the right expression for parsing. — user1040730, Sep 05 '13 at 21:11

Parse elements from a html file (treating it as a text file) in java

1 Answers1