6

Possible Duplicates:
Does IMDB provide an API?
How to send a header using a HTTP request through a curl call?

I am using PHP curl to scrape movie details from IMDB. It works perfectly in fetching data but the problem i am facing right now is:

When I fetch non English movies like this movie.

When I open this movie in my browser then it shows me "IMDB English"-version page of this movie which shows movie name "Boarding School". But when i fetch the data through curl then it fetch the original page for this movie where the movie name is "Leidenschaftliche Blümchen".

So please suggest me how to fetch the curl data in English version IMDB page.

Community
  • 1
  • 1
pravat231
  • 782
  • 1
  • 11
  • 26
  • have you tried passing a valid user agent with region information? the option is `-A` in curl – James Wilcox Aug 10 '11 at 10:14
  • 1
    From the [IMDB ToS](http://www.imdb.com/help/show_article?conditions): *Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.* – Gordon Aug 10 '11 at 10:20
  • 1
    Or, you could actually just skip trying to parse their unstructured data and download their structured data. http://www.imdb.com/interfaces – John Green Aug 10 '11 at 10:21
  • bcoz i already saw somewherelse they get the result as usual what i want.. – pravat231 Aug 10 '11 at 10:22

1 Answers1

3

When you request a page with a Browser, the Browser sends specific request headers to the server. A firefox extension like firebug can show these (check Net), these are exemplary the headers I just send over to the server with firefox:

GET /title/tt0076306/ HTTP/1.1
Host: www.imdb.com
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.8,de-de;q=0.5,de;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
...

The one that makes a difference possibly:

Accept-Language: en-us,en;q=0.8,de-de;q=0.5,de;q=0.3

See 14.4 Accept-Language.

When you use curl, it will send specific request headers as well but they might differ. However you can command curl to use the headers you specifiy, too.

You just need to make curl use the headers your browser uses and you should get the same result. See How to send a header using a HTTP request through a curl call?.

For getting the german version of the page for example:

curl -H "Accept-Language: de-de;q=0.8,de;q=0.5" http://www.imdb.com/title/tt0076306/

For the english version:

curl -H "Accept-Language: en-us,en;q=0.8,de-de;q=0.5,de;q=0.3" http://www.imdb.com/title/tt0076306/
Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • could you please tell me what will be the proper header for this..As i already tried this but not getting.. – pravat231 Aug 10 '11 at 10:19
  • @pravat231: I extended the answer, made a suggestion and linked the specification of the header in question. – hakre Aug 10 '11 at 10:23
  • Ya, I was trying to the same thing. The other suspect was Javascript. Did you "actually" try sending the same headers and check what the response is? – Gaurav Gupta Aug 10 '11 at 10:24
  • Yes i already tried but i saw in another website they get the pproper result. – pravat231 Aug 10 '11 at 10:27
  • @Gaurav Gupta: Added a curl calling example that does this for me, both german and english. – hakre Aug 10 '11 at 10:28
  • are you getting the english version title for the given link which i mentioned ?? – pravat231 Aug 10 '11 at 10:29
  • @pravat231: I added an english variant as well. It just works for me. Read the specs of the header, it tells you how it works and how you can specify the language you would like to have for the request. It's all in the docs. – hakre Aug 10 '11 at 10:33
  • thank youlet me give sometime i checked it.. – pravat231 Aug 10 '11 at 10:34