I've found a way how to do it using Jsoup and Java 8 Stream API:
//given:
final InputStream html = getClass().getClassLoader().getResourceAsStream("table.html");
//when:
final Document document = Jsoup.parse(html, "UTF-8", "/");
final List<List<String>> result = document.select("table tr")
.stream()
// Select all <td> tags in single row
.map(tr -> tr.select("td"))
// Repeat n-times those <td> that have `colspan="n"` attribute
.map(rows -> rows.stream()
.map(td -> Collections.nCopies(td.hasAttr("colspan") ? Integer.valueOf(td.attr("colspan")) : 1, td))
.flatMap(Collection::stream)
.collect(Collectors.toList())
)
// Fold final structure to 2D List<List<Element>>
.reduce(new ArrayList<List<Element>>(), (acc, row) -> {
// First iteration - just add current row to a final structure
if (acc.isEmpty()) {
acc.add(row);
return acc;
}
// If last array in 2D array does not contain element with `rowspan` - append current
// row and skip to next iteration step
final List<Element> last = acc.get(acc.size() - 1);
if (last.stream().noneMatch(td -> td.hasAttr("rowspan"))) {
acc.add(row);
return acc;
}
// In this case last array in 2D array contains an element with `rowspan` - we are going to
// add this element n-times to current rows where n == rowspan - 1
final AtomicInteger index = new AtomicInteger(0);
last.stream()
// Map to a helper list of (index in array, rowspan value or 0 if not present, Jsoup element)
.map(td -> Arrays.asList(index.getAndIncrement(), Integer.valueOf(td.hasAttr("rowspan") ? td.attr("rowspan") : "0"), td))
// Filter out all elements without rowspan
.filter(it -> ((int) it.get(1)) > 1)
// Add all elements with rowspan to current row at the index they are present
// (add them with `rowspan="n-1"`)
.forEach(it -> {
final int idx = (int) it.get(0);
final int rowspan = (int) it.get(1);
final Element td = (Element) it.get(2);
row.add(idx, rowspan - 1 == 0 ? (Element) td.removeAttr("rowspan") : td.attr("rowspan", String.valueOf(rowspan - 1)));
});
acc.add(row);
return acc;
}, (a, b) -> a)
.stream()
// Extract inner HTML text from Jsoup elements in 2D array
.map(tr -> tr.stream()
.map(Element::text)
.collect(Collectors.toList())
)
.collect(Collectors.toList());
I've added a lot of comments that explain what happens at specific algorithm step.
In this example I've used following html file:
<body>
<table border="1">
<tr><td>H1</td><td colspan="2">H2</td></tr>
<tr><td></td><td>SubH2_1</td><td>SubH2_2</td></tr>
<tr><td rowspan="2">A1</td><td>B1</td><td>C1</td></tr>
<tr><td>B2</td><td>C3</td></tr>
<tr><td>C4</td><td>C5</td><td>C6</td></tr>
<tr><td>D7</td><td colspan="2">D9</td></tr>
<tr><td colspan="3">Notes</td></tr>
</table>
</body>
It's the same as yours, the only difference is it has rowspan
usage fixed - in your example A1
is repeated three times instead of two. Also two <tr>
in this example were closed correctly, otherwise two additional empty arrays show up in the final structure.
Here is the console output:
[H1, H2, H2]
[, SubH2_1, SubH2_2]
[A1, B1, C1]
[A1, B2, C3]
[C4, C5, C6]
[D7, D9, D9]
[Notes, Notes, Notes]
You can run this example with exact HTML as you pasted in your question, it will produce a little bit different output:
[H1, H2, H2]
[]
[, SubH2_1, SubH2_2]
[]
[A1, B1, C1]
[A1, B2, C1]
[A1, B2, C3]
[C4, C5, C6]
[D7, D9, D9]
[Notes, Notes, Notes]
Those empty arrays show up because there are two unclosed <tr>
elements in your HTML.
<tr><td>H1</td><td colspan="2">H2</td><tr>
<tr><td></td><td>SubH2_1</td><td>SubH2_2</td><tr>
Closing them and running algorithm again will create following output:
[H1, H2, H2]
[, SubH2_1, SubH2_2]
[A1, B1, C1]
[A1, B2, C1]
[A1, B2, C3]
[C4, C5, C6]
[D7, D9, D9]
[Notes, Notes, Notes]
As you can see A1
exists 3 times because it has an attribute rowspan="3"
and B2
has rowspan="2"
and C1
has rowspan="2"
as well. It generates HTML that looks "almost" the same as one in my first example, but when you take a closer look to those 3 rows you will see that they are not at the same pixel level. Following your expected response I have fixed the input HTML to look and behave as you expect.
What if I cannot modify input HTML?
Well, if you cannot modify input HTML then you will have to:
- filter out all empty arrays created due to unclosed
<tr>
tags
- review your output expectations for
A1
, B2
and C3
- the HTML view does not show the exact structure of this table written in HTML.
Source code of sample project
Here you can find full source code of a JUnit test I used to found the answer to your question. Feel free to download this sample Maven project hosted on GitHub to play around with the implementation of the algorithm.
I hope it helps.