Crawl Website using PHP

Question

I've tried a bunch of techniques to crawl this url (see below), and for some reason the title comes back incorrect. If I look at the source of the page with firebug I can see the correct title tag, however, if I view the page source it's different.

Using several php techniques I get the same result. Digg is able to crawl the page and parse the correct title.

Here's the link: http://lifehacker.com/#!5772420/how-to-make-ios-more-like-android

The correct title is "How to Make Your iPhone (or Other iOS Device) More Like Android" The parsed title is "Lifehacker, tips and downloads for getting things done"

Is this normal? How are they doing this? Is there a way to get the correct title?

See: http://stackoverflow.com/questions/3009380/whats-the-shebang-hashbang-in-facebook-and-new-twitter-urls-for — , Mar 08 '11 at 04:28

score 1 · Answer 1 · answered Mar 08 '11 at 04:28

That's because when you request it using PHP (without any JS support) you're getting the main page of lifehacker - which is lifehacker.com.

Lifehacker switched their CMS recently so that all requests go to an initial page and then everything after the hashbang is read by a JS script in the main page to figure out which page needs to be served. You need to modify your program to take this into account

EDIT Have a gander at these links

http://code.google.com/web/ajaxcrawling/docs/getting-started.html

http://www.tbray.org/ongoing/When/201x/2011/02/09/Hash-Blecch

score 0 · Accepted Answer · answered Mar 08 '11 at 21:08

0

Found the answer:

http://lifehacker.com/#!5772420/how-to-make-ios-more-like-android

becomes:

http://lifehacker.com/?_escaped_fragment_=5772420/how-to-make-ios-more-like-android

answered Mar 08 '11 at 21:08

Ward

3,318
3
30
50

Crawl Website using PHP

2 Answers2