1

I have html file. I am trying to extract "table" content between two anchors.

Here is the sample html content:

<HTML>
<HEAD>
<TITLE>
Test Doc
</TITLE>
</HEAD>
<BODY LINK=#000000 VLINK=#000000 ALINK=#990000>

<A NAME = "linkTab0000"></A>
<TABLE CELLPADING=1 BORDER=2 WIDTH=100%>
<TR>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#a6caf0>
<B><FONT SIZE=3 COLOR=#000000 FACE='HP Simplified'>Test Entity</FONT></B></TD>
</TR>
</TABLE><TABLE CELLPADING=2 BORDER=2 WIDTH=100%>
<TR>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#64b1ff WIDTH=200>
<B><FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>Name</FONT></B></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#64b1ff WIDTH=200>
<B><FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>Datatype</FONT></B></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#64b1ff WIDTH=200>
<B><FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>Definition</FONT></B></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#64b1ff WIDTH=200>
<B><FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>Note</FONT></B></TD>
</TR>
<TR>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#ffffff>
<FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>test</FONT></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#ffffff>
<FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>test</FONT></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#ffffff>
<FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>test</FONT></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#ffffff>
<FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>&nbsp</FONT></TD>
</TR>
</TABLE>

<A NAME = "linkTab0001"></A>
<TABLE CELLPADING=1 BORDER=2 WIDTH=100%>
<TR>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#a6caf0>
<B><FONT SIZE=3 COLOR=#000000 FACE='HP Simplified'>Test Entity</FONT></B></TD>
</TR>
</TABLE><TABLE CELLPADING=2 BORDER=2 WIDTH=100%>
<TR>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#64b1ff WIDTH=200>
<B><FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>Name</FONT></B></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#64b1ff WIDTH=200>
<B><FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>Datatype</FONT></B></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#64b1ff WIDTH=200>
<B><FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>Definition</FONT></B></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#64b1ff WIDTH=200>
<B><FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>Note</FONT></B></TD>
</TR>
<TR>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#ffffff>
<FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>test</FONT></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#ffffff>
<FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>CHAR(18)</FONT></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#ffffff>
<FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>test</FONT></TD>
<TD ALIGN=LEFT VALIGN=TOP BGCOLOR=#ffffff>
<FONT SIZE=2 COLOR=#000000 FACE='HP Simplified'>&nbsp</FONT></TD>
</TR>
</TABLE>

I want to extract the "TABLE" element between "<A NAME = "linkTAB....." elements.

Below is the code I am using:

     Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

     System.out.println(doc.html());

     String inputStr = doc.html();
     String link = "<a name = "+"linkTab";
     Pattern p = Pattern.compile("(\\b^"+link+"\\b)(.*?)(\\^"+link+"\\b)");
     Matcher m = p.matcher(inputStr);
     List<String> matches = new ArrayList<String>();
     while (m.find()) {
         matches.add(m.group());
     }

I also tried using bufferedreader but it ignores the String link = "

Please let me know of any suggestions.

Thanks

d0x
  • 11,040
  • 17
  • 69
  • 104
user1040730
  • 31
  • 2
  • 9
  • To ask a clarifying question. Are you trying to extract data while its running or are you just trying to read the file? If you are trying to extract data while its running you want it to send its data out in XML format. than java can read it. Here is a link for java xml http://stackoverflow.com/questions/428073/what-is-the-best-simplest-way-to-read-in-an-xml-file-in-java-application – fftk4323 Sep 05 '13 at 19:26
  • I am trying to read the file here – user1040730 Sep 05 '13 at 19:33
  • @user1040730 sorry, which `table` element would you like to extract? It's not clear from the question to me... – Katona Sep 05 '13 at 19:43
  • you mean from this website? Be a little more specific than... Here – fftk4323 Sep 05 '13 at 20:05
  • 1
    Apologies! I meant the html content highlighted in the question. – user1040730 Sep 05 '13 at 21:00
  • @katona, the table element in the html: – user1040730 Sep 05 '13 at 21:01

1 Answers1

1

Regex is not a good approache to parse XML files like that.

You better use XPath or the Inbuild CSS-like selectors.

This is how I solved it for your problem:

public static void main(String[] args) throws IOException {
    // Read your html into a string
    StringWriter writer = new StringWriter();
    IOUtils.copy(Main.class.getResourceAsStream("/so18644171/html.html"), writer);
    String theString = writer.toString();

    Document doc = Jsoup.parse(theString);

    // a[name^=linkTab] means: 
    // all a's having a attribute name, starting with "linkTab"
    Elements linkTabs = doc.select("a[name^=linkTab] + table");

    // "a[name^=linkTab] + table means: All tables followed by a[...]

    System.out.println(linkTabs);
}

This prints:

<table cellpading="1" border="2" width="100%"> 
 <tbody>
  <tr> 
   <td align="LEFT" valign="TOP" bgcolor="#a6caf0"> <b><font size="3" color="#000000" face="HP Simplified">Test Entity</font></b></td> 
  </tr> 
 </tbody>
</table>
<table cellpading="1" border="2" width="100%"> 
 <tbody>
  <tr> 
   <td align="LEFT" valign="TOP" bgcolor="#a6caf0"> <b><font size="3" color="#000000" face="HP Simplified">Test Entity</font></b></td> 
  </tr> 
 </tbody>
</table>

I've uploaded this example to:

https://github.com/d0x/questions/blob/master/stackoverflowPlayground/src/main/java/so18644171/Main.java

d0x
  • 11,040
  • 17
  • 69
  • 104