I am looking for a way to cleanly convert HTML tables to readable plain text.
I.e. given an input:
<table>
<tr>
<td>Height:</td>
<td>200</td>
</tr>
<tr>
<td>Width:</td>
<td>440</td>
</tr>
</table>
I expect the output:
Height: 200
Width: 440
I would prefer not using external tools, e.g. w3m -dump file.html
, because they are (1) platform-dependent, (2) I want to have some control over the process and (3) I assume it is doable with Python alone with or without extra modules.
I don't need any word-wrapping or adjustable cell separator width. Having tabs as cell separators would be good enough.
Update
This was an old question for an old use case. Given that pandas provides the read_html method, my current answer would definitely be pandas-based.