I have been trying to replicate the parsing capabilities of the Evernote Web Clipper in python for my own web-scraping projects. I'm interested in extracting the main body of text only, nothing else.
I've used both the python Arc90 port:
https://github.com/buriy/python-readability
in combination with aaronsw's wonderful html2text library:
https://github.com/aaronsw/html2text
and this gives good results most of the time, but Evernote is just much better at scraping the main body of text.
Could someone please recommend a better approach, or perhaps tell me what Evernote is doing.
Thanks!