0

I am writing a crawler in java that examines an IMDB movie page and extracts some info like name, year etc. User writes (or copy/pastes) the link of the tittle and my program should do the rest.

After examining html sources of several (imdb) pages and browsing on how crawlers work I managed to write a code.

The info I get (for example title) is in my mother tongue. If there is no info in my mother tongue I get the original title. What I want is to get the title in a specific language of my choosing.

I'm fairly new to this so correct me if I'm wrong but I get the results in my mother tongue because imdb "sees" that I'm from Serbia and than customizes the results for me. So basically I need to tell it somehow that I prefer results in English? Is that possible (i imagine it is) and how do I do it?

edit: Program crawls like this: it gets the url path in String, converts it to url, reads all of the source with bufferedreader and inspects what it gets. I'm not sure if that is the right way to do it but it's working (minus the language problem) code:

public static Info crawlUrl(String urlPath) throws IOException{
        Info info = new Info();

        //
        URL url = new URL(urlPath);
        URLConnection uc = url.openConnection();
        BufferedReader in = new BufferedReader(new InputStreamReader(
                uc.getInputStream(), "UTF-8"));
        String inputLine;
        while ((inputLine = in.readLine()) != null){
            if(inputLine.contains("<title>")) System.out.println(inputLine);
        }
        in.close();
        //
        return info;
    }

this code goes trough a page and prints the main title on console.

Invader Zim
  • 796
  • 2
  • 14
  • 39

2 Answers2

4

You don't need to crawl IMDB, you can use the dumps they provide: http://www.imdb.com/interfaces

There's also a parser for the data they provide: https://code.google.com/p/imdbdumpimport/ it's not perfect but maybe it will help you (you can expect spending some effort to make it work).

An alternative parser: https://github.com/dedeler/imdb-data-parser

EDIT You're saying you want to crawl IMDB anyway for learning purposes. So you'll probably have to go with http://en.wikipedia.org/wiki/Content_negotiation as suggested in the other answer:

uc.setRequestProperty("Accept-Language", "de; q=1.0, en; q=0.5");
Jakub Kotowski
  • 7,411
  • 29
  • 38
  • I know that there are IMDB dumps but I'm not writing this only for it's functionality but also for learning purposes. – Invader Zim Jan 03 '14 at 22:53
  • In that case you'll have to go with http://en.wikipedia.org/wiki/Content_negotiation like cubitouch proposes in the other answer. – Jakub Kotowski Jan 03 '14 at 22:57
2

Try to look at the request headers used by your crawler, mine is containing Accept-Language:fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4 so I get the title in French.

EDIT :

I checked with ModifyHeaders add-on on Google Chrome and the value en-US is getting me the English title for the movie =)

cubitouch
  • 1,929
  • 15
  • 28
  • I'll try that. So i just install that add-on and set the value and that's it? – Invader Zim Jan 03 '14 at 22:35
  • With your web browser yes, then you just need to add a particular http header, and the right value, to your request when crawling the target. It seems that you should look at http://docs.oracle.com/javase/6/docs/api/java/net/URLConnection.html#setRequestProperty%28java.lang.String,%20java.lang.String%29 – cubitouch Jan 03 '14 at 22:37
  • I can't get it to work :(( I installed modHeader add on on chrome and entered the value and I still get the results in serbian. How do i check request headers used by program? Any useful links? I'm a noob considering these things – Invader Zim Jan 03 '14 at 22:50
  • As I was saying, modHeader can only be used for Google Chrome web browser. To look at every network request yu can used some software like WireShark. Did you tried to used the setRequestProperty method ? What is your source code for now ? – cubitouch Jan 03 '14 at 22:53
  • 1
    Try look at this thread to add the http header http://stackoverflow.com/questions/2793150/how-to-use-java-net-urlconnection-to-fire-and-handle-http-requests – cubitouch Jan 03 '14 at 22:58
  • setRequestProperty("Accept-Language", "en-GB") works. Thank you very much. :DD – Invader Zim Jan 03 '14 at 23:05