0

I'm currently building a Instapaper clone and need some help designing the algorithm.

It has two components:

  1. Extract the main text block from an HTML document
  2. If the save article has more than 1 page then extract text from all pages

Can you guys point me to the right direction? I will be using .NET 4 C# for this project.

Jason
  • 1
  • 1
  • 1
    This is kind of like saying "I'd like to build a compiler. It has two componets, thing that reads the code, and the code generator. Can you guys point me in the right direction"? – Foredecker Dec 28 '10 at 19:51
  • Not asking you to do it for me. Just want some recommendations. – Jason Dec 28 '10 at 20:01
  • What do you mean by question #2? Typically html don't have the concept of multiple pages unless they are being printed or the developer built in method of providing chunks of the full document. – Mike Chess Dec 28 '10 at 20:17

1 Answers1

1
  1. Use Html Agility Pack to extract the stuff you need from the HTML document.
  2. Same as #1.

I suppose that doesn't provide you with much direction, but you didn't provide me with much direction, either.

Brian
  • 25,523
  • 18
  • 82
  • 173
  • How can you detect if the article is on multiple page? – Jason Dec 28 '10 at 20:10
  • @Jason: I don't understand that question. Do you mean how can you detect if an article has multiple pages (e.g. [Hidden Features of .net](http://stackoverflow.com/questions/9033/hidden-features-of-c) has 11 pages of answers)? The easiest way is to search for urls with names or alt text like numbers or the word "next", as well as to search for `rel="next"` in the `a` tags. Though this is something to be careful about, since some urls might be made up of 100s of pages (e.g., blogs or webcomics). – Brian Dec 28 '10 at 20:14