4

How to convert HTML tables with colspan and rowspan into 2d array (martix) in Java?

I have found nice solutions in Python and jQuery but not in Java (only very simple tables via jsoup). There is one pretty solution with XSLT but due malformed input HTML files it is not OK for me.

Input table example:

  <body>
    <table border="1">
        <tr><td>H1</td><td colspan="2">H2</td><tr>
        <tr><td></td><td>SubH2_1</td><td>SubH2_2</td><tr>
       <tr><td rowspan="3">A1</td><td>B1</td><td rowspan="2">C1</td></tr>
       <tr><td rowspan="2">B2</td></tr>
       <tr><td>C3</td></tr>
       <tr><td>C4</td><td>C5</td><td>C6</td></tr>
        <tr><td>D7</td><td colspan="2">D9</td></tr>
        <tr><td  colspan="3">Notes</td></tr>
   </table>
</body>

enter image description here

Desired output:

    [['H1', 'H2', 'H2'],
     ['', 'SubH2_1', 'SubH2_2'],
     ['A1', 'B1', 'C1'],
     ['A1', 'B2', 'C3'],
     ['C4', 'C5', 'C6'],
     ['D7', 'D9', 'D9'],
     ['Notes', 'Notes', 'Notes']]
GML-VS
  • 1,101
  • 1
  • 9
  • 34

1 Answers1

2

I've found a way how to do it using Jsoup and Java 8 Stream API:

//given:
final InputStream html = getClass().getClassLoader().getResourceAsStream("table.html");

//when:
final Document document = Jsoup.parse(html, "UTF-8", "/");

final List<List<String>> result = document.select("table tr")
    .stream()
    // Select all <td> tags in single row
    .map(tr -> tr.select("td"))
    // Repeat n-times those <td> that have `colspan="n"` attribute
    .map(rows -> rows.stream()
        .map(td -> Collections.nCopies(td.hasAttr("colspan") ? Integer.valueOf(td.attr("colspan")) : 1, td))
        .flatMap(Collection::stream)
        .collect(Collectors.toList())
    )
    // Fold final structure to 2D List<List<Element>>
    .reduce(new ArrayList<List<Element>>(), (acc, row) -> {
        // First iteration - just add current row to a final structure
        if (acc.isEmpty()) {
            acc.add(row);
            return acc;
        }

        // If last array in 2D array does not contain element with `rowspan` - append current
        // row and skip to next iteration step
        final List<Element> last = acc.get(acc.size() - 1);
        if (last.stream().noneMatch(td -> td.hasAttr("rowspan"))) {
            acc.add(row);
            return acc;
        }

        // In this case last array in 2D array contains an element with `rowspan` - we are going to
        // add this element n-times to current rows where n == rowspan - 1
        final AtomicInteger index = new AtomicInteger(0);
        last.stream()
            // Map to a helper list of (index in array, rowspan value or 0 if not present, Jsoup element)
            .map(td -> Arrays.asList(index.getAndIncrement(), Integer.valueOf(td.hasAttr("rowspan") ? td.attr("rowspan") : "0"), td))
            // Filter out all elements without rowspan
            .filter(it -> ((int) it.get(1)) > 1)
            // Add all elements with rowspan to current row at the index they are present 
            // (add them with `rowspan="n-1"`)
            .forEach(it -> {
                final int idx = (int) it.get(0);
                final int rowspan = (int) it.get(1);
                final Element td = (Element) it.get(2);

                row.add(idx, rowspan - 1 == 0 ? (Element) td.removeAttr("rowspan") : td.attr("rowspan", String.valueOf(rowspan - 1)));
            });

        acc.add(row);
        return acc;
    }, (a, b) -> a)
    .stream()
    // Extract inner HTML text from Jsoup elements in 2D array
    .map(tr -> tr.stream()
        .map(Element::text)
        .collect(Collectors.toList())
    )
    .collect(Collectors.toList());

I've added a lot of comments that explain what happens at specific algorithm step.

In this example I've used following html file:

<body>
<table border="1">
    <tr><td>H1</td><td colspan="2">H2</td></tr>
    <tr><td></td><td>SubH2_1</td><td>SubH2_2</td></tr>
    <tr><td rowspan="2">A1</td><td>B1</td><td>C1</td></tr>
    <tr><td>B2</td><td>C3</td></tr>
    <tr><td>C4</td><td>C5</td><td>C6</td></tr>
    <tr><td>D7</td><td colspan="2">D9</td></tr>
    <tr><td  colspan="3">Notes</td></tr>
</table>
</body>

It's the same as yours, the only difference is it has rowspan usage fixed - in your example A1 is repeated three times instead of two. Also two <tr> in this example were closed correctly, otherwise two additional empty arrays show up in the final structure.

Here is the console output:

[H1, H2, H2]
[, SubH2_1, SubH2_2]
[A1, B1, C1]
[A1, B2, C3]
[C4, C5, C6]
[D7, D9, D9]
[Notes, Notes, Notes]

You can run this example with exact HTML as you pasted in your question, it will produce a little bit different output:

[H1, H2, H2]
[]
[, SubH2_1, SubH2_2]
[]
[A1, B1, C1]
[A1, B2, C1]
[A1, B2, C3]
[C4, C5, C6]
[D7, D9, D9]
[Notes, Notes, Notes]

Those empty arrays show up because there are two unclosed <tr> elements in your HTML.

<tr><td>H1</td><td colspan="2">H2</td><tr>
<tr><td></td><td>SubH2_1</td><td>SubH2_2</td><tr>

Closing them and running algorithm again will create following output:

[H1, H2, H2]
[, SubH2_1, SubH2_2]
[A1, B1, C1]
[A1, B2, C1]
[A1, B2, C3]
[C4, C5, C6]
[D7, D9, D9]
[Notes, Notes, Notes]

As you can see A1 exists 3 times because it has an attribute rowspan="3" and B2 has rowspan="2" and C1 has rowspan="2" as well. It generates HTML that looks "almost" the same as one in my first example, but when you take a closer look to those 3 rows you will see that they are not at the same pixel level. Following your expected response I have fixed the input HTML to look and behave as you expect.

What if I cannot modify input HTML?

Well, if you cannot modify input HTML then you will have to:

  • filter out all empty arrays created due to unclosed <tr> tags
  • review your output expectations for A1, B2 and C3 - the HTML view does not show the exact structure of this table written in HTML.

Source code of sample project

Here you can find full source code of a JUnit test I used to found the answer to your question. Feel free to download this sample Maven project hosted on GitHub to play around with the implementation of the algorithm.

I hope it helps.

Szymon Stepniak
  • 40,216
  • 10
  • 104
  • 131