I'm trying to figure out why html2text is breaking my HTML:
<div><table> <tbody> <tr> <td> <span><strong><a href="/pages/about_paul_221673.cfm"><span>About</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a href="/pages/contact_us_222511.cfm"><span>Contact</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a><span>Maths Games Order</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a href="/pages/faqs_222510.cfm"><span>FAQ</span></a></strong></span></td> </tr> </tbody> </table>s<div> <span><strong>Broadbent Maths Ltd<br> 3 High Street, Welbourn, Lincoln, LN5 0NH </strong></span></div> </div>
Processing it with:
cat "/home/spider/original-file.txt" | html2text -utf8 -nobs -style pretty
When I run that, I get:
nput recoding failed due to invalid input sequence. Unconverted part of text follows. ▒Contact ▒Maths Games Order ▒FAQ
s Broadbent Maths Ltd 3 High Street, Welbourn, Lincoln, LN5 0NH
When I run Devel::Peek::Dump()
(Perl), I see the string as:
SV = PV(0x564c0a72c860) at 0x564c09967c80
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK,UTF8)
PV = 0x564c0a58bc60 "\n<div><table> <tbody> <tr> <td> <span><strong><a href=\"/pages/about_paul_221673.cfm\"><span>About</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a href=\"/pages/contact_us_222511.cfm\"><span>Contact</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a><span>Maths Games Order</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a href=\"/pages/faqs_222510.cfm\"><span>FAQ</span></a></strong></span></td> </tr> </tbody> </table>s<div> <span><strong>Broadbent Maths Ltd<br> 3 High Street, Welbourn, Lincoln, LN5 0NH </strong></span></div> </div>\n"\0 [UTF8 "\n<div><table> <tbody> <tr> <td> <span><strong><a href="/pages/about_paul_221673.cfm"><span>About</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a href="/pages/contact_us_222511.cfm"><span>Contact</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a><span>Maths Games Order</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a href="/pages/faqs_222510.cfm"><span>FAQ</span></a></strong></span></td> </tr> </tbody> </table>s<div> <span><strong>Broadbent Maths Ltd<br> 3 High Street, Welbourn, Lincoln, LN5 0NH </strong></span></div> </div>\n"]
CUR = 725
LEN = 736
COW_REFCNT = 1
If I remove the first bit:
<div><table>
It works fine! I don't get why its breaking there though - all seems ok to me?