0

I'm making a news reading application. The best site I found was http://fulltextrssfeed.com/ It takes the text and images from any webpage and gives back clean text. As they don't have an API I need some way to get the data from the <div>. This is the div ID:

<div id="preview">

How can I leach onto the feed and get only its content (It would be a plus if there are no HTML tags, if there is I can make a work around.)

Cœur
  • 37,241
  • 25
  • 195
  • 267
Allison
  • 2,213
  • 4
  • 32
  • 56
  • possible duplicate of [Is there a library for extracting data from an HTML page?](http://stackoverflow.com/questions/8972013/is-there-a-library-for-extracting-data-from-an-html-page) – jscs May 27 '12 at 17:16
  • Thats in `C++` this is Objective C – Allison May 27 '12 at 20:22
  • Beware, You might compromise with their copyrights. Web Scrapping should be done very cautiously. – geekay Feb 17 '13 at 15:44

2 Answers2

1

I'm not sure about your question, but if you're using obj-c, I really recommend Hpple. It's a really good XML/HTML parser.

To use it, you'll need to add ${SDKROOT}/usr/include/libxml2 in "Header Search Path", in your project option and add -lxml2 to "Other Linker Flag".

Then, when you already have the Hpple files, drag it to your code: TFHpple.h, TFHpple.m, TFHppleElement.h, TFHppleElement.m, XPathQuery.h, XPathQuery.m.

In the code (To get your div "preview"), add:

NSData *htmlData = [[NSString stringWithContentsOfURL:[NSURL URLWithString: @"http://www.yoursite.com/index.html"]] dataUsingEncoding:NSUTF8StringEncoding];

TFHpple *xpathParser = [[TFHpple alloc] initWithHTMLData:htmlData];
NSArray *elements  = [xpathParser searchWithXPathQuery:@"//div[@id='preview']"]; // Here we use 
TFHppleElement *element = [elements objectAtIndex:0];
NSString *string = [element content];
NSLog(@"%@", string);

[xpathParser release];
[htmlData release];

Now we have the "preview div" with Hpple. To get some subclass (as p or a), use it:

NSArray *elements  = [xpathParser searchWithXPathQuery:@"//div[@id='preview']/p/text()"]; 

To undertand more, take a look at XPath Syntax. Also check a tutorial.

Hope it help.

0

I use this to strip all html very succesfully

NSString + Strip HTML

Nick
  • 1,315
  • 9
  • 16