I have a document that contains <br/> , <p> , and <table>
elements
I have been trying to parse this HTML using Jsoup
and preserve the lines.
I tried many methods from similar questions but no result
FileInputStream in = new FileInputStream("C:............xxx.htm");
String htmlText = IOUtils.toString(in);
File file = new File("C:............xxx.txt") ;
PrintWriter pr = new PrintWriter(file) ;
String text = Jsoup.parse(htmlText.replaceAll("(?i)<br[^>]*>", "br2n")).text();
System.out.println(text.replaceAll("br2n", "\n"));
pr.println(text.replaceAll("br2n", "\n"));
// for (String line : htmlText.split("\n")) {
// String stripped = Jsoup.parse(line).text();
//
// System.out.println(stripped);
// pr.println(stripped);
//
// }
pr.close();
Here is the representative part of my HTML file (the original file starts with <html>
...of course)
<table border="0" cellspacing="0" cellpadding="0" bgcolor="white"
width='650'>
<tr>
<td><font size="4"><br />
<b>The scientific explantion of the syndrom</b></font>
<table width='650' border="0" cellspacing="5" cellpadding="0">
<tr>
<td width='5%'> </td>
<td width='25%'> </td>
<td width='25%'> </td>
<td width='15%'> </td>
<td width='30%'> </td>
</tr>
<tr height="24">
<td align="left" nowrap="nowrap" colspan="3"><font size=
"3"><b>Recent Update</b></font></td>
<td align="left" nowrap="nowrap"><a name=
"9J003346248"></a><font size="3"><b>Issue:</b></font></td>
<td align="left"><font size="3">9569865248</font></td>
</tr>
<tr>
<td> </td>
<td align="left"><b>Locust:</b></td>
<td align="left" colspan="3">UYF78UIGK</td>
</tr>
</table>
<br/> The explanation above does not necc....... <p>
Blah ....
</p>
<table border="2" cellspacing="1" cellpadding="0" bgcolor="white"
width='750'>
<tr>
<td><font size="4"><br />
<b>Syndrom of the main ......</b></font>
<table width='650' border="0" cellspacing="5" cellpadding="0">
<tr>
<td width='5%'> </td>
<td width='25%'> </td>
<td width='25%'> </td>
<td width='15%'> </td>
<td width='30%'> </td>
</tr>
<tr height="24">
<td align="left" nowrap="nowrap" colspan="3"><font size=
"3"><b>Data</b></font></td>
<td align="left" nowrap="nowrap"><a name=
"9J003346248"></a><font size="3"><b>Issue:</b></font></td>
<td align="left"><font size="3">9509809248</font></td>
</tr>
<tr>
<td> </td>
<td align="left"><b>Locust:</b></td>
<td align="left" colspan="3">U344365GK</td>
</tr>
</table>
<br/> The explanation above does not necc....... <p>
Blah ....
</p>
I need to make sure that all rows in those table lie one after another the way they do in the original document. But I have multiple tables and other "line breaking elements". How can I do this using Jsoup? Is it possible to parse html and keep line using other api more effectively?