8

I have been trying to replicate the parsing capabilities of the Evernote Web Clipper in python for my own web-scraping projects. I'm interested in extracting the main body of text only, nothing else.

I've used both the python Arc90 port:

https://github.com/buriy/python-readability

in combination with aaronsw's wonderful html2text library:

https://github.com/aaronsw/html2text

and this gives good results most of the time, but Evernote is just much better at scraping the main body of text.

Could someone please recommend a better approach, or perhaps tell me what Evernote is doing.

Thanks!

vgoklani
  • 10,685
  • 16
  • 63
  • 101
  • I'm sorry, but I think your question is too vague and overly broad to be answered here on SO; see the [FAQ#dontask]. If you have more concrete problems (preferably involving some code), feel free to ask those! – Martijn Pieters Feb 11 '13 at 22:31
  • 2
    I don't agree that "it's overly vague". I am asking for an approach that one would normally use to scrape webpages and get results comparable to Evernote. To me, that question is very specific. – vgoklani Feb 12 '13 at 00:11
  • @Vishal We don't even know what is "results comparable to Evernote". You need to give more specific requirements. – wRAR Feb 12 '13 at 00:16
  • 1
    Checkout: http://stackoverflow.com/a/24860961/88597 – ohho Oct 12 '15 at 06:23

0 Answers0