How does the Evernote Web Clipper parse webpages so well?

Asked Feb 11 '13 at 22:10

Active Feb 11 '13 at 22:29

Viewed 1,551 times

I have been trying to replicate the parsing capabilities of the Evernote Web Clipper in python for my own web-scraping projects. I'm interested in extracting the main body of text only, nothing else.

I've used both the python Arc90 port:

https://github.com/buriy/python-readability

in combination with aaronsw's wonderful html2text library:

https://github.com/aaronsw/html2text

and this gives good results most of the time, but Evernote is just much better at scraping the main body of text.

Could someone please recommend a better approach, or perhaps tell me what Evernote is doing.

Thanks!

edited Feb 11 '13 at 22:29

asked Feb 11 '13 at 22:10

vgoklani

10,685
16
63
101

I'm sorry, but I think your question is too vague and overly broad to be answered here on SO; see the [FAQ#dontask]. If you have more concrete problems (preferably involving some code), feel free to ask those! – Martijn Pieters Feb 11 '13 at 22:31
2

I don't agree that "it's overly vague". I am asking for an approach that one would normally use to scrape webpages and get results comparable to Evernote. To me, that question is very specific. – vgoklani Feb 12 '13 at 00:11
@Vishal We don't even know what is "results comparable to Evernote". You need to give more specific requirements. – wRAR Feb 12 '13 at 00:16
1

Checkout: http://stackoverflow.com/a/24860961/88597 – ohho Oct 12 '15 at 06:23

How does the Evernote Web Clipper parse webpages so well?

0 Answers0