2

For those veterans who haven't tried Hpple, it's great. It uses Xpath for searching through HTML/XML documents. It gets the job done and it's easy enough for a newbie like me to understand. However, I'm having trouble.

I have this chunk of HTML:

    <ul class="challengesList dailyChallengesList">

<li>
<div class="corner topLeft"></div>
<img id="ctl00_mainContent_dailyChallengesRepeater_ctl00_challengeImage" title="Gunslinger" src="/images/reachstats/challenges/0.png" alt="Gunslinger" style="border-width:0px;">
<div class="info">
<div class="rFloat">
<p id="ctl00_mainContent_dailyChallengesRepeater_ctl00_challengeExpiration" class="timeDisplay dailyExpirationCountdown"><span>0d</span><span>19h</span><span>9m</span><span class="seconds">37s</span></p>
<p>1500cR</p>
</div>
<h5>Gunslinger</h5>
<p class="description">Kill 150 enemies in multiplayer Matchmaking.</p>
<div class="reward">

<div id="ctl00_mainContent_dailyChallengesRepeater_ctl00_progressBox" class="barContainer">
<div id="ctl00_mainContent_dailyChallengesRepeater_ctl00_progressBar" class="bar" style="width:21%;"><span></span></div> 
<p>31/150</p>
</div>
</div>
</div>
<div class="clear"></div>
</li>

<li>
<div class="corner topLeft"></div>
<img id="ctl00_mainContent_dailyChallengesRepeater_ctl01_challengeImage" title="A Great Friend" src="/images/reachstats/challenges/0.png" alt="A Great Friend" style="border-width:0px;">
<div class="info">
<div class="rFloat">
<p id="ctl00_mainContent_dailyChallengesRepeater_ctl01_challengeExpiration" class="timeDisplay dailyExpirationCountdown"><span>0d</span><span>19h</span><span>9m</span><span class="seconds">37s</span></p>
<p>1400cR</p>
</div>
<h5>A Great Friend</h5>
<p class="description">Earn 15 assists today in multiplayer Matchmaking.</p>
<div class="reward">

<div id="ctl00_mainContent_dailyChallengesRepeater_ctl01_progressBox" class="barContainer">
<div id="ctl00_mainContent_dailyChallengesRepeater_ctl01_progressBar" class="bar" style="width:40%;"><span></span></div> 
<p>6/15</p>
</div>
</div>
</div>
<div class="clear"></div>
</li>

<li>
<div class="corner topLeft"></div>
<img id="ctl00_mainContent_dailyChallengesRepeater_ctl02_challengeImage" title="Cannon Fodder" src="/images/reachstats/challenges/2.png" alt="Cannon Fodder" style="border-width:0px;">
<div class="info">
<div class="rFloat">
<p id="ctl00_mainContent_dailyChallengesRepeater_ctl02_challengeExpiration" class="timeDisplay dailyExpirationCountdown"><span>0d</span><span>19h</span><span>9m</span><span class="seconds">37s</span></p>
<p>1000cR</p>
</div>
<h5>Cannon Fodder</h5>
<p class="description">Kill 50 infantry-class foes in the Campaign today.</p>
<div class="reward">

<div id="ctl00_mainContent_dailyChallengesRepeater_ctl02_progressBox" class="barContainer">
<div id="ctl00_mainContent_dailyChallengesRepeater_ctl02_progressBar" class="bar" style="width:0%;"><span></span></div> 
<p>0/50</p>
</div>
</div>
</div>
<div class="clear"></div>
</li>

<li>
<div class="corner topLeft"></div>
<img id="ctl00_mainContent_dailyChallengesRepeater_ctl03_challengeImage" title="Heroic Demon" src="/images/reachstats/challenges/3.png" alt="Heroic Demon" style="border-width:0px;">
<div class="info">
<div class="rFloat">
<p id="ctl00_mainContent_dailyChallengesRepeater_ctl03_challengeExpiration" class="timeDisplay dailyExpirationCountdown"><span>0d</span><span>19h</span><span>9m</span><span class="seconds">37s</span></p>
<p>1500cR</p>
</div>
<h5>Heroic Demon</h5>
<p class="description">Kill 30 Elites in Firefight Matchmaking on Heroic or harder.</p>
<div class="reward">

<div id="ctl00_mainContent_dailyChallengesRepeater_ctl03_progressBox" class="barContainer">
<div id="ctl00_mainContent_dailyChallengesRepeater_ctl03_progressBar" class="bar" style="width:0%;"><span></span></div> 
<p>0/30</p>
</div>
</div>
</div>
<div class="clear"></div>
</li>

</ul>

The nutty part is, I cannot get Hpple to "see" the <div class="reward">. I'm using the following to find it:

NSArray * rawProgress = [doc search:@"//ul[@class='challengesList']
                                          /li/div[@class='info']
                                                 /div[@class='reward']/p"];

This always returns an empty array. It's driving me nuts, as the same kind of thing worked for all of the other elements in this project...

Any help would be appreciated :)

EDIT

This works:

NSArray * rawDescriptions = [doc search:@"//ul[@class='challengesList']
                                              /li/div[@class='info']
                                                     /p[@class='description']"];

This doesn't:

NSArray * rawProgress = [doc search:@"//ul[@class='challengesList']
                                          /li/div[@class='info']
                                                 /div[@class='reward']
                                                     /div[@id]//p"];

Furthermore, trying to list the child nodes of rFloat or reward produces a crash :(

Aurum Aquila
  • 9,126
  • 4
  • 25
  • 24
  • Don't forget to put backquotes around the `
    ` element in the text of your question... fixed it for you.
    – LarsH Dec 07 '10 at 13:14
  • It got unfixed by your edit. I'll leave it to you to put the backquotes in where needed, after 'cannot get Hpple to "see" the'. – LarsH Dec 07 '10 at 13:22
  • Can you post more of your input HTML? And triple-check that what you posted is really what's coming in as input? – LarsH Dec 07 '10 at 15:37
  • I put in 4/5ths of the input HTML. You can view the full source at: view-source:http://www.bungie.net/Stats/Reach/Challenges.aspx?player=Aurum+Aquila – Aurum Aquila Dec 07 '10 at 15:54
  • Also note, the original page is here: http://www.bungie.net/Stats/Reach/Challenges.aspx?player=Aurum+Aquila – Aurum Aquila Dec 07 '10 at 15:55

2 Answers2

1

Your "p" element is not an immediate child of div class="reward".

Using XML you provided, XPath expression

div[@class='info']/div[@class='reward']//p

will work.

Flack
  • 5,862
  • 2
  • 23
  • 27
  • Thanks for the recommendation, but this returns a null value. I've added an example of an expression that does work. – Aurum Aquila Dec 07 '10 at 13:09
  • @Aurum - @Flack is right that your first XPath, as given, *should* not return anything because `div[@class='reword']` has no immediate p child element. – LarsH Dec 07 '10 at 13:24
  • But the problem is, when I ask it to list reward's children, there appears to be nothing in it. When I ask it to list info, reward does not appear. – Aurum Aquila Dec 07 '10 at 14:26
0
  • See this SO question for a similar report on problems with Hpple and a list of alternatives.

You may be seeing a bug. According to this page,

It's classified as an experimental project by the developer, but so far it's "worked for me"

UPDATE: seems to be kinda broken now. Anyone got a better solution?

You may want to enter a bug report, and if the project is still being maintained, maybe the developer will respond with a fix or solution. Or you could leave a comment on this page that recommended hpple, and see if that blogger or one of his readers can address the problem or tell you if hpple is active at all.

You could also see if you can find HyperParser. "It's a simple HTML parser that has API similar to NSXMLParser. Designed specially to parse semi-valid HTML." But it doesn't seem to be there at the link where it used to be.

Community
  • 1
  • 1
LarsH
  • 27,481
  • 8
  • 94
  • 152
  • Yeah, it's an issue with libxml. I tried using it straight up, with the same result. I think the website has malformed HTML... So, I'm thinking about using HTML tidy or scraping the stats from someone else with better HTML. – Aurum Aquila Dec 08 '10 at 00:28
  • Could this have anything to do with the fact that the img tags aren't closed? – Aurum Aquila Dec 08 '10 at 02:35
  • @Aurum: Seems unlikely, since one of your XPath expressions with `li/div[@class='info']` is working, when there is an `` before that div. But based on what you found out about libxml, why lean on a broken reed? HTMLTidy sounds like a good solution. – LarsH Dec 08 '10 at 13:13
  • According to the site, HyperParser is now part of the BaseAppKit, located here: http://baseappkit.com/ – s73v3r Jan 04 '12 at 06:25