3

Given the NSString below which was originally converted from a CData object retrieved from parsing an XML document with NSXMLParser, how can I obtain the following properties of the book: title, book image cover, author , price, and rating?

Here is my rudimentary solution for obtaining the following properties

  1. Book Title - I could probably obtain this by looking at the riRssTitle span class, but then I would have to figure out how to read the title between the ahref url tag to obtain the title

  2. Book Image - I would have to get by grabbing the first URL http://ecx.images-amazon.com/images/I/41Lg22K3ViL._SL160_PIsitb-sticker-arrow-dp,TopRight,12,-18_SH30_OU02_.jpg and then leaving everything up to http://ecx.images-amazon.com/images/I/41Lg22K3ViL omitting the rest, and then appending the .jpg tag to have a complete url for image retrieval at a later point.

  3. Book Author - I'd have to follow the same step as step 1 but instead searching for the riRssContributor span tag.

  4. Book Price - Here there is no price tag thats common across all items, one thing I do see in common though is that the price is always in a font tag where it then sits in a BOLD tag.

  5. Rating - which can probably be retrieved by looking for the URL that contains the word stars and then grab the numbers that follow it, 4 means 4 stars, any number with -5 appended to it means an extra .5 stars. so 3-5 would mean 3.5 stars.

Whats the best way of doing this without it getting messy? Also I dont like how my code can break should Amazon decide to change the way it displays it's URLS, my app relies on amazon maintaining its url naming conventions.

For now though, is this the best way forward? Is there a quick parser that can achieve what I wish to do?

This is an example of an amazon RSS feed: http://www.amazon.co.uk/gp/rss/bestsellers/books/72/ref=zg_bs_72_rsslink


Here is the below CData NSString data I retrieve for each item.

<div style="float:left;">
    <a class="url" href="http://www.amazon.co.uk/Gone-Girl-Gillian-Flynn/dp/0753827662/ref=pd_zg_rss_ts_b_72_9">
        <img src="http://ecx.images-amazon.com/images/I/41Lg22K3ViL._SL160_PIsitb-sticker-arrow-dp,TopRight,12,-18_SH30_OU02_.jpg" alt="Gone Girl" border="0" hspace="0" vspace="0" />
    </a>
</div>
<span class="riRssTitle">
    <a href="http://www.amazon.co.uk/Gone-Girl-Gillian-Flynn/dp/0753827662/ref=pd_zg_rss_ts_b_72_9">Gone Girl</a>
</span>
<br />
<span class="riRssContributor">
    <a href="http://www.amazon.co.uk/Gillian-Flynn/e/B001JP3W46/ref=ntt_athr_dp_pel_1">Gillian Flynn</a>
    <span class="byLinePipe">(Author)</span>
</span>
<br />
<img src="http://g-ecx.images-amazon.com/images/G/02/x-locale/common/icons/uparrow_green_trans._V192561975_.gif" width="13" align="abstop" alt="Ranking has gone up in the past 24 hours" title="Ranking has gone up in the past 24 hours" height="11" border="0" />
<font color="green">
    <strong></strong>
</font> 674 days in the top 100 
<br />
<img src="http://g-ecx.images-amazon.com/images/G/02/detail/stars-4-0._V192253865_.gif" width="64" height="12" border="0" style="margin: 0; padding: 0;"/>(5704)
<br />
<br />
<a href="http://www.amazon.co.uk/Gone-Girl-Gillian-Flynn/dp/0753827662/ref=pd_zg_rss_ts_b_72_9">Buy new: </a>
<strike>£9.07</strike>
<font color="#990000">
    <b>£3.85</b>
</font>
<br />
<a href="http://www.amazon.co.uk/gp/offer-listing/0753827662/ref=pd_zg_rss_ts_b_72_9?ie=UTF8&condition=all">60 used & new</a> from 
<span class="price">£2.21</span>
<br />
<br />(Visit the 
<a href="http://www.amazon.co.uk/Best-Sellers-Books-Crime-Thrillers-Mystery/zgbs/books/72/ref=pd_zg_rss_ts_b_72_9">Bestsellers in Crime, Thrillers & Mystery</a> list for authoritative information on this product's current rank.)
Pavan
  • 17,840
  • 8
  • 59
  • 100

4 Answers4

8

TFHpple is definitely the library to go with to parse HTML. (>1000 stars on github) https://github.com/topfunky/hpple

Here's the obj-c solution for that RSS feed:

NSString *stringURL = @"http://www.amazon.co.uk/gp/rss/bestsellers/books/72/ref=zg_bs_72_rsslink";
NSURL  *url = [NSURL URLWithString:stringURL];
NSData *htmlData = [NSData dataWithContentsOfURL:url];

TFHpple * doc = [[TFHpple alloc] initWithHTMLData:htmlData];

NSArray *titleElements = [doc searchWithXPathQuery:@"//span[@class='riRssTitle']/a"];
for (TFHppleElement *element in titleElements)
{
    NSString *title = element.firstChild.content;
    NSLog(@"title: %@", title);
}

NSArray *imageElements = [doc searchWithXPathQuery:@"//a[@class='url']/img"];
for (TFHppleElement *element in imageElements)
{
    NSString *image = element.attributes[@"src"];
    NSMutableArray *parts = [[image componentsSeparatedByString:@"/"] mutableCopy];
    NSArray *pathParts = [parts.lastObject componentsSeparatedByString:@"."];
    [parts removeLastObject];
    [parts addObject:[NSString stringWithFormat:@"%@.%@",pathParts.firstObject, pathParts.lastObject]];
    image = [parts componentsJoinedByString:@"/"];
    NSLog(@"image: %@", image);
}

NSArray *authorElements = [doc searchWithXPathQuery:@"//span[@class='riRssContributor']/a"];
for (TFHppleElement *element in authorElements)
{
    NSString *author = element.firstChild.content;
    NSLog(@"author: %@", author);
}

NSArray *priceElements = [doc searchWithXPathQuery:@"//font/b"];
for (TFHppleElement *element in priceElements)
{
    NSString *price = element.firstChild.content;
    NSLog(@"price: %@", price);
}

NSArray *ratingElements = [doc searchWithXPathQuery:@"//img"];
for (TFHppleElement *element in ratingElements)
{
    if (![element.attributes[@"src"] containsString:@"stars"])
        continue;

    NSArray *parts = [element.attributes[@"src"] componentsSeparatedByString:@"-"];
    if (parts.count < 5) continue;

    NSString *rating = [NSString stringWithFormat:@"%@.%@", parts[3], [parts[4] substringToIndex:1]];
    NSLog(@"rating: %@", rating);
}

Like you said, you are at the mercy of Amazon's naming conventions.

Krys Jurgowski
  • 2,871
  • 17
  • 25
  • Why the for loops when the path is provided? I would understand if you were searching through the elements to find something specific but I see a value being assigned for every iteration of every for loop. Whats happening? – Pavan Nov 16 '14 at 17:59
  • Because the link to the RSS feed you provided has multiple items. The for loops print out the property in each item. – Krys Jurgowski Nov 16 '14 at 22:56
  • so, you're simply echoeing the items and proposing that as a solution to my question? – Pavan Nov 17 '14 at 13:49
  • There are 10 items in the RSS feed. Each of those for loops echo's one of the 6 properties you wanted to parse out. First for loop - 10 book titles, Second for loop - 10 book images, etc – Krys Jurgowski Nov 17 '14 at 15:50
  • so if you check the live feed youll notice that the price sometimes jumps places, and that some values return null for the author because then they jump in other tags. Is there no better way? I can't have null values. – Pavan Nov 17 '14 at 20:03
4

You can use TFHpple and TFHppleElement for parsing the above data as your need.

Here is the reference for doing this.

Neenu
  • 6,848
  • 2
  • 28
  • 54
2

I saw your post on the iOS developer facebook group and thought I'd give my last minute input.

Because amazon doesn't keep a strict naming convention, you have to search through the feed. This is what I attempt to do, but then I try and make it look less hackish. If you notice, you'll find that sometimes the feed returns missing values if you try to scour for specific path names, so I've tried to make up for that case too.

For this to work you simply need to download the NSDictionary category from this URL: https://github.com/nicklockwood/XMLDictionary

.h
#import <Foundation/Foundation.h>

@interface JMAmazonProcessor : NSObject
+(NSArray*)processAmazonResponseWithXMLData:(NSData*)responseObject;
@end

and for

.m

#import "JMAmazonProcessor.h"

@implementation JMAmazonProcessor



+(NSString*)getBookTitleWithArray:(NSArray*)array{
    return [[array[0] objectForKey:kAmazonAHREFKey] objectForKey:kAmazonUnderscoreTextKey];
}

+(NSString*)getBookAuthorWithArray:(NSArray*)array{
    id bookAuthor = [[array[1] objectForKey:kAmazonAHREFKey] objectForKey:kAmazonUnderscoreTextKey];

    if(!bookAuthor){
        bookAuthor = [array[1] objectForKey:kAmazonUnderscoreTextKey];
    }

    if([bookAuthor isKindOfClass:[NSArray class]]){
        bookAuthor = [bookAuthor componentsJoinedByString:@" "];
    }

    return bookAuthor;
}
+(NSString*)getPriceFromDictionary:(NSDictionary*)dictionary{
    return [NSString stringWithUTF8String:[[[[dictionary objectForKey:@"font"] lastObject] objectForKey:@"b"] cStringUsingEncoding:NSUTF8StringEncoding]];
}


+(NSString*)getRatingWithCurrentRatingDictionary:(NSDictionary*)ratingDictionary{
    NSString * stars;
    if([ratingDictionary objectForKey:@"_src"]){
        NSString * possibleStarsURL = [ratingDictionary objectForKey:@"_src"];
        if([possibleStarsURL rangeOfString:@"stars-" options:NSCaseInsensitiveSearch].location != NSNotFound){
            stars = [[[[[possibleStarsURL componentsSeparatedByString:@"stars-"] lastObject] componentsSeparatedByString:@"."] firstObject] stringByReplacingOccurrencesOfString:@"-" withString:@"."];
        }
    }


    return stars;

}
+(NSString*)getRatingFromDictionary:(NSDictionary*)dictionary{
    id currentDictionary = [dictionary objectForKey:@"img"];
    NSString *rating;

    if([currentDictionary isKindOfClass:[NSArray class]]){
        for(int i = 0; i < [currentDictionary count]; i++){
            NSDictionary *currentRatingDictionary = [currentDictionary objectAtIndex:i];

            if((rating = [self getRatingWithCurrentRatingDictionary:currentRatingDictionary])){
                break;
            }
        }
    }

    else if([currentDictionary isKindOfClass:[NSDictionary class]]){
        rating = [self getRatingWithCurrentRatingDictionary:currentDictionary];
    }

    if(!rating) rating = @"Rating is not currently available";
    return rating;
}


+(NSArray*)processAmazonResponseWithXMLData:(NSData*)responseObject{
    NSMutableArray *bookEntries = [[NSMutableArray alloc] init];

    NSDictionary * itemDictionary = [[NSDictionary dictionaryWithXMLData:responseObject] objectForKey:kAmazonRootNode];
    for(int i = 0; i < [[itemDictionary objectForKey:kAmazonFeedItemKey] count]; i++){
        RSSBookEntryModel *cBEO = [[RSSBookEntryModel alloc] init];

        NSDictionary *currentItem = [[itemDictionary objectForKey:kAmazonFeedItemKey] objectAtIndex:i];
        NSString *finalXMLString = [NSString stringWithFormat:@"%@%@%@", kAmazonStartTag, [currentItem objectForKey:kAmazonDescriptionKey], kAmazonEndTag];
        NSDictionary *cData = [NSDictionary dictionaryWithXMLString:finalXMLString];

        NSArray *bookDetailsDictionary = [cData objectForKey:kAmazonSpanKey];

        NSString *bIOURL = [[[[cData objectForKey:@"div"] objectForKey:kAmazonAHREFKey] objectForKey:@"img"] objectForKey:@"_src"];
        NSString *bookImageCoverID = [[[[bIOURL componentsSeparatedByString:kAmazonBookCoverBaseURL] lastObject] componentsSeparatedByString:@"."] firstObject];




        cBEO.bookTitle = [self getBookTitleWithArray:bookDetailsDictionary];
        cBEO.bookAuthor = [self getBookAuthorWithArray:bookDetailsDictionary];
        cBEO.bookCoverImageThumbnailURL = [NSString stringWithFormat:@"%@%@%@%@", kAmazonBookCoverBaseURL, bookImageCoverID, kAmazonBookCoverThumbnailSize, kAmazonBookCoverFileExtention];
        cBEO.bookCoverImageOriginalURL = [NSString stringWithFormat:@"%@%@%@%@", kAmazonBookCoverBaseURL, bookImageCoverID, kAmazonBookCoverMaxSize, kAmazonBookCoverFileExtention];
        cBEO.bookPrice = [self getPriceFromDictionary:cData];
        cBEO.bookRating = [self getRatingFromDictionary:cData];

        [bookEntries addObject:cBEO];


    }
    return bookEntries;
}
@end

Sorry. Here it is: This is the object model you want to use, its pretty straight forward.

@interface RSSBookEntryModel : NSObject

@property (strong, nonatomic) NSString *bookTitle;
@property (strong, nonatomic) NSString *bookAuthor;
@property (strong, nonatomic) NSString *bookCoverImageThumbnailURL;
@property (strong, nonatomic) NSString *bookCoverImageOriginalURL;
@property (strong, nonatomic) NSData *bookCoverThumbnailImage;
@property (strong, nonatomic) NSData *bookCoverOriginalImage;

@property (strong, nonatomic) NSString *bookPrice;
@property (strong, nonatomic) NSString *bookRating;


-(NSString*)description;

@end

And here are the constants I'm using to keep everything clean.

Constant.h

extern NSString * const kAmazonRootNode;
extern NSString * const kAmazonStartTag;
extern NSString * const kAmazonEndTag;

extern NSString * const kAmazonFeedItemKey;
extern NSString *const kAmazonSpanKey;
extern NSString * const kAmazonDescriptionKey;
extern NSString *const kAmazonUnderscoreTextKey;
extern NSString *const kAmazonAHREFKey;
extern NSString *const kAmazonBookCoverBaseURL;

extern NSString *const kAmazonBookCoverThumbnailSize;
extern NSString *const kAmazonBookCoverMaxSize;
extern NSString *const kAmazonBookCoverFileExtention;

And here's the Constants.m file.

NSString * const kAmazonRootNode = @"channel";
NSString * const kAmazonStartTag = @"<startTag>";
NSString * const kAmazonEndTag = @"</startTag>";

NSString * const kAmazonFeedItemKey = @"item";
NSString *const kAmazonSpanKey = @"span";
NSString * const kAmazonDescriptionKey = @"description";
NSString *const kAmazonUnderscoreTextKey = @"__text";
NSString *const kAmazonAHREFKey = @"a";
NSString *const kAmazonBookCoverBaseURL = @"http://ecx.images-amazon.com/images/";

NSString *const kAmazonBookCoverThumbnailSize = @"._SL100";
NSString *const kAmazonBookCoverMaxSize = @"._SL500";
NSString *const kAmazonBookCoverFileExtention = @".jpg";
Jay Maragh
  • 436
  • 2
  • 3
  • wow, ok, just looking at the code, I noticed the author code you're covering that use case where sometimes the values returned are `null`! great. but what does the RSSBookEntryModel stand for and whats contained within? Also where are the constants? – Pavan Nov 17 '14 at 20:40
  • I'm glad you noticed the extra handling ;) – Jay Maragh Nov 17 '14 at 20:44
  • You don't need to feed anything into the arrays, those methods are used by the main method, and its to that main method that you feed the amazon response object to. everything else is taken care of, you will then receive an array of your ten book objects from that amazon feed. – Jay Maragh Nov 17 '14 at 20:47
  • worked like a charm, but this is still very hardcoded. I gave it to you because you dealt with the cases where sometimes the author value would return null. Thanks – Pavan Nov 17 '14 at 20:49
1

This is a pretty weak alternative here, but maybe it helps somehow:

//title
console.log("TITLE: " + $(".riRssTitle").text().trim());

//image
console.log("IMAGE: " + $(document).find("img").attr("src"));

//author
console.log("AUTHOR: " + $(".riRssContributor").find("a").text().trim());

//new price and striked price
var new_price_striked_element = $("a:contains('Buy new')").siblings("strike");
if(new_price_striked_element){
   console.log("NEW PRICE STRIKED: " + new_price_striked_element.text().trim());     
}else{
   console.log("NEW PRICE: " + $("a:contains('Buy new')").siblings("b").text().trim()); 
}

//used price
console.log("USED PRICE FROM: " + $(".price").text().trim());

//stars
var url = $("img[src*='stars']").attr("src");
var myRegexp = /stars-([0-9]-[0-9])/g;
var match = myRegexp.exec(url);
console.log("STARS: " + match[1]);

EXAMPLE:http://jsfiddle.net/qpuaxtv3/

carlosHT
  • 493
  • 3
  • 9
  • It's really annoying that the item nodes from the feed in Amazon are not consistent. Sometimes the price can be retrieved and other times it cannot. There are no specific nodes for the different entities. I ended up relying on creating many different scenarios, and doing a lot of "hunting", its not howI expected to solve the problem – Pavan Nov 11 '14 at 23:26
  • Yeah, this is pretty weak. It works for one specific scenario but might not for the next. – carlosHT Nov 11 '14 at 23:28