I need to parse an HTML table containing colspans and rowspans and build a representation of it.
Reading the HTML is not a problem, I'm using HTMLCleaner and XQuery with Saxon (Java).
But I'm looking for a good algorithm to build the table, as I don't understand the rules that are followed by the browsers for "difficult" cases.
For example, given the following table (where the rowspan is wrong)
<table border=1>
<tr><td rowspan="3">1</td><td>2</td></tr>
<tr><td>3</td></tr>
</table>
I apply the following algorithm:
1) for each tr
1.1) expand the colspan and rowspan of the cells in the current line
1.2) create a new line if it doesn't already exist
1.3) for each td add the elements to the line
i.e. (E is an empty cell)
newline->no line existing==no expansion
add line elements (1.3)
line1: 1 [tr=3], 2
newline->tr expansion (1.1)
line1: 1[tr=3], 2
line2: E
line3: E
add line elements (1.3)
line1: 1[tr=3], 2
line2: E, 3
line3: E
line3 has to be removed (Firefox renders only two lines), how can I know it?
I'm particularly interested in cases where the elements of an incomplete line are completed with those of the following one, like:
<tr><td>1</td><td>2</td><td>3</td></tr>
<tr><td>4</td><td>5</td></tr>
<tr><td>6</td></tr>
rendering: 1 2 3
4 5 6
I have a practical case: this file contains two TRs which are rendered as one even though they are two different TR. Why?
The lines are these (starting from line 129792)
they are rendered as (inside the red rectangle)
How can I decide to enqueue elements to a previous line?
What rules do browsers follow for weird code?
I'm using Java, I understand also javascript and a little of PHP, but I I'm mainly interested in the algorithm to follow. I'd like to know if something already exists or to hear any suggestion.
What I want is to be able to output a text representation of the table like one rendered by a real browser.
Edit:
After I read xtratic answer, I read the HTML table processing model specification, but it doesn't seem to answer my question about when one must enqueue elements to the previous line, as in the practical case I described (and added in this edit). Indeed, the documents says "16 If current cell is the last td or th element child in the tr element being processed, then increase ycurrent by 1, abort this set of steps, and return to the algorithm above.". But not always it happens that we go to a new line when the last td is found.
What I'm interested more is when to combine different rows. I tried to enqueue TDs after the ones of the previous line when the number of TDs of the previous line is fewer than the maximum already found, but it doesn't work