1

I want to extract the body paragraphs from a web page and store them into a string.

First, I obtain the entire source code using

NSString *sourceCode = [NSString stringWithContentsOfURL:[NSURL URLWithString:currentLink] encoding:NSUTF8StringEncoding error:&error];

The body paragraphs begin after <!-- (START) Pagination Content Wrapper --> and ends before <!-- (END) Pagination Content Wrapper -->

so I plan to split the string like so

NSString *startingPt = @"<!-- (START) Pagination Content Wrapper -->";
NSString *endingPt = @"<!-- (END) Pagination Content Wrapper -->";

NSString *sub = [sourceCode substringFromIndex:NSMaxRange([str rangeOfString:startingPt])];
sub = [sourceCode substringToIndex:[s rangeOfString:endingPt].location;

Then I would use stringByReplacingOccurrencesOfString:withString: to replace the remaining html tags with @""

Is there a better way to achieve my goal?

Mahir
  • 1,684
  • 5
  • 31
  • 59

2 Answers2

0

After obtaining the sub string removing START & END, you can simply use NSString+HTML categories to escape the html tags, its a very good categories to implement html encoding, decoding and more, and main is it you can use it for your NSString instances no need to create a separate objects for that purpose.

Objective C HTML escape/unescape Here you can find more discussions on it.

These are the methods available as suggested in that post & i like it.

- (NSString *)stringByConvertingHTMLToPlainText;
- (NSString *)stringByDecodingHTMLEntities;
- (NSString *)stringByEncodingHTMLEntities;
- (NSString *)stringWithNewLinesAsBRs;
- (NSString *)stringByRemovingNewLinesAndWhitespace;
Community
  • 1
  • 1
vishy
  • 3,241
  • 1
  • 20
  • 25
0

You're going to have to find the HTML tags before you remove them. Unless you know for a fact that there are a limited number of tags that this system will ever need to use, you shouldn't hard-code a list of them in your code. And with -stringByReplacingOccurrences..., you need an exact string, with all of the arguments ID and class tags, etc., which makes it even more subject to change.

Unless you're going to use the third-party extension suggested by vishy, which looks like it does what you need, you're going to have to do something like this:

1) Find the first occurrence of "<" in the string

2) See if the "<" is escaped.

3) If not, find the next ">".

4) See if that is escaped.

5) If not, create an NSRange for the tag (from "<" to ">") and use -stringByReplacingCharactersInRange to get rid of it.

6) Repeat until you don't find any more unescaped "<".

This will leave you with de-HTMLified text, but NOT plain text. You will still see HTML escapes, and just as importantly, there is no guarantee that the whitespace (which is ignored in HTML) will make any sense once the HTML is removed.

chapka
  • 490
  • 1
  • 3
  • 11
  • There are random amounts of whitespaces, as you mentioned. Is there no way of getting rid of them? – Mahir Oct 23 '12 at 07:46
  • Use [myString stringByTrimmingCharactersInSet: [NSCharacterSet whitespaceAndNewlineCharacterSet]]; – chapka Oct 27 '12 at 01:31
  • I had already added that. It only gets rid of the spaces before the beginning of the text and after, but not the ones between paragraphs – Mahir Oct 27 '12 at 01:58