1

I need to parse an HTML table containing colspans and rowspans and build a representation of it.

Reading the HTML is not a problem, I'm using HTMLCleaner and XQuery with Saxon (Java).

But I'm looking for a good algorithm to build the table, as I don't understand the rules that are followed by the browsers for "difficult" cases.

For example, given the following table (where the rowspan is wrong)

<table border=1>
    <tr><td rowspan="3">1</td><td>2</td></tr>
    <tr><td>3</td></tr>
</table>

I apply the following algorithm:

1) for each tr 
    1.1) expand the colspan and rowspan of the cells in the current line
    1.2) create a new line if it doesn't already exist
    1.3) for each td add the elements to the line

i.e. (E is an empty cell)

newline->no line existing==no expansion
add line elements (1.3)
line1: 1 [tr=3], 2

newline->tr expansion (1.1)
line1: 1[tr=3], 2
line2: E
line3: E

add line elements (1.3)
line1: 1[tr=3], 2
line2: E, 3
line3: E

line3 has to be removed (Firefox renders only two lines), how can I know it?

I'm particularly interested in cases where the elements of an incomplete line are completed with those of the following one, like:

<tr><td>1</td><td>2</td><td>3</td></tr>
<tr><td>4</td><td>5</td></tr>
<tr><td>6</td></tr>

rendering: 1 2 3 
           4 5 6

I have a practical case: this file contains two TRs which are rendered as one even though they are two different TR. Why?

The lines are these (starting from line 129792) enter image description here

they are rendered as (inside the red rectangle)

enter image description here

How can I decide to enqueue elements to a previous line?

What rules do browsers follow for weird code?

I'm using Java, I understand also javascript and a little of PHP, but I I'm mainly interested in the algorithm to follow. I'd like to know if something already exists or to hear any suggestion.

What I want is to be able to output a text representation of the table like one rendered by a real browser.

Edit:

After I read xtratic answer, I read the HTML table processing model specification, but it doesn't seem to answer my question about when one must enqueue elements to the previous line, as in the practical case I described (and added in this edit). Indeed, the documents says "16 If current cell is the last td or th element child in the tr element being processed, then increase ycurrent by 1, abort this set of steps, and return to the algorithm above.". But not always it happens that we go to a new line when the last td is found.

What I'm interested more is when to combine different rows. I tried to enqueue TDs after the ones of the previous line when the number of TDs of the previous line is fewer than the maximum already found, but it doesn't work

cdarwin
  • 4,141
  • 9
  • 42
  • 66

1 Answers1

3

Read the HTML table processing model specification to find out all you need to know about how to process HTML tables. (it's not easy)

Since you want to parse the form of an html table, I recommend writing your processor following the steps exactly as listed under §4.9.12.1 Forming a table (step 18 gets into processing rows). I'm quite sure this is how browsers do it as well. The steps are written in such a way to be as convenient as possible for translating into code for a processor so you should be able to follow it pretty literally. Once your processor is done you should have a table of cells (as it is defined) and then you do whatever you want with the table model you now have. I can't promise it will be easy but at least you'll have a step by step guide.


To be extra clear: there is no "combining rows" but there are cells that span multiple rows.

The algorithm for growing downward is what puts GENERALI SPA.. at the start of all those rows, and the data from the following <tr> elements is added into the next available cells on their respective rows.

GENERALI SPA... spans 4 rows, but it's first row is hidden since there's no other data on it, so it looks like it only covers 3.

<tr> <!-- row 1 (0px high) -->
    <!-- td spans from [1,1] to [1,4] -->
    <!-- this fills the first column of rows 1, 2, 3, and 4 -->
    <td rowspan="4">GENERALI SPA #1</td>
</tr>
<tr> <!-- row 2 -->
    <!-- col 1 is taken by the cell defined above -->
    <!-- td spans from [2,2] to [2,3] taking up col 2 of row 2 and 3 -->
    <td rowspan="2">GENERALI SPA #2</td>
    <td>Proprieta'</td> <!-- ... -->
</tr>
<tr> <!-- row 3 -->
    <!-- col 1 and 2 are taken by the cells defined above -->
    <td rowspan="1">Totale #1</td> <!-- ... -->
</tr>
<tr> <!-- row 4 -->
    <!-- col 1 is taken by the cell defined above -->
    <td colspan="2">Totale #2</td> <!-- ... -->
</tr>

The table without formatting or hiding would look like this:

   1                      2                     3             4
  +----------------------+---------------------+-------------+---
1 |         ...          |      (implied)         (implied)       <-- 0px high (hidden)
  +-                    -+---------------------+-------------+---
2 | "GENERALI SPA #1"    | "GENERALI SPA #2"   | "Proprieta" | ..
  +-                    -+-                   -+-------------+---
3 |         ...          |         ...         | "Totale #1" | ..
  +-                    -+---------------------+-------------+---
4 |         ...          | "Totale #2"               ...     | ..
  +----------------------+---------------------+-------------+---

This would essentially be the table model you get after parsing by following the process in the html spec.

I don't see much point in removing "incomplete" rows (define incomplete), let them stay in the table, they are essentially header rows coming before more data that they encompass, and they aren't really hurting anything, you can detect them easily enough.

However, if you really want to then you could remove rows that have no explicitly created cells other than cells that span into other rows. In the case of the table section above, you could remove row 1 because column 1 spans rows 1, 2, 3, and 4, and row 1 has no other explicitly created cells. Thus all the data of row 1 still exists in the cells the data spans ([[1,2], [1,3], [1,4]) and you can safely remove row 1.

As an extra example, when I change rowspan to 1, this data appears on its own row and the following <tr> data fills the available cells on their respective rows:

enter image description here


vvv less relevant info vvv

The older HTML 4.01 Specification, has a straight-forward example relating to your question:

The next example illustrates (with the help of table borders) how cell definitions that span more than one row or column affect the definition of later cells. Consider the following table definition:

<TABLE border="1">
<TR><TD>1 <TD rowspan="2">2 <TD>3
<TR><TD>4 <TD>6
<TR><TD>7 <TD>8 <TD>9
</TABLE>

As cell "2" spans the first and second rows, the definition of the second row will take it into account. Thus, the second TD in row two actually defines the row's third cell. Visually, the table might be rendered to a tty device as:

-------------
| 1 | 2 | 3 | 
----|   |----
| 4 |   | 6 |
----|---|----
| 7 | 8 | 9 |
-------------

Note that if the TD defining cell "6" had been omitted, an extra empty cell would have been added by the user agent to complete the row.

This related question and answer lists some libraries that can help you in scraping the tables, but I don't believe this answer would handle the "difficult" cases since it's assuming that the occurrence of the <td> element corresponds exactly to its cell index in the table.

xtratic
  • 4,600
  • 2
  • 14
  • 32
  • Thank you for the specification, but you can read what I think about what's written there in the edit section of my question I added. I also added a pratical case which is driving me mad. – cdarwin Apr 16 '18 at 17:48
  • Initially based on the spec, I was thinking that a `` should *always* be an independent row and any ommitted `` would just be implied. But seeing that Firefox and Chrome both seem to combine the two `` elements into a single row then, you're right, they must be using some similar processing that we are missing. – xtratic Apr 16 '18 at 18:30
  • If you need to handle the weird cases like this then I think that, rather than try to figure out the weird cases and add them into your own parser, it might be easiest to scrap your own approach and just write the exact table processor as the html spec lays it out since that would handle every weird case. I have not been able to find a library that does this. – xtratic Apr 16 '18 at 19:25
  • What I say is that following literally the specification one doesn't write a parser like the ones used by Firefox or Chrome, because the algorithm for rows (algorithm for processing rows, step 16) says to begin a new row when the last td is reached. While I posted an example of two TRs that are combined by real browsers. – cdarwin Apr 17 '18 at 15:24
  • I can almost guarantee you that the browsers use this algorithm, else they would not be adhering to the html spec. I'm updating my answer to show where in the algorithm this one row is becoming part of other rows: it's because of the `rowspan` attribute causing it to fill more cells below it, thus causing that data to be part of the same rows as the ``s after as the rows after it. – xtratic Apr 17 '18 at 16:05
  • @cdarwin Edited. If this answers all your questions then please accept this answer. If you have further questions then I'll be happy to help. – xtratic Apr 17 '18 at 16:50
  • I agree with you that, after parsing, the table read is that drawn by you, and so it seems to me that the algorithm is incomplete. What is the condition to hide the first line of the resulting table? I was thinking about removing incomplete rows, but this doesn't work with the extra example you added. In the meanwhile, thank you for your help. – cdarwin Apr 17 '18 at 17:30
  • 1
    I highly doubt that this algorithm is incomplete. The row still exists in Chrome and Firefox but it's 0px in height since (I believe by default) the row height is fitting the data, which in this case allows the first row to be 0px high since the data in the first cell can fill the other 3 cells of the column that it "rowspan"s into. I'll update my answer again to clarify more but please accept this answer, this is getting kinda long and SO is prompting me to move this conversation to chat. – xtratic Apr 17 '18 at 18:26
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/169188/discussion-between-xtratic-and-cdarwin). – xtratic Apr 17 '18 at 18:47