How to avoid server error 401 (and 403) while using boilerpipe?

Question

I use BoilerPipe for Java to extract some articles from the internet. It works in a lot of sites, but in several sites I get a Http 401 server error, when I don't need any authentication in my web browser...

Here's an example of site which returns 401 error : http://www.nature.com/nchem/journal/v7/n4/full/nchem.2206.html

I call the ArticleExtractor with this :

URL url = new URL("http://www.nature.com/nchem/journal/v7/n4/full/nchem.2206.html");
String article = ArticleExtractor.INSTANCE.getText(url);

And here's the error :

de.l3s.boilerpipe.BoilerpipeProcessingException: java.io.IOException: Server returned HTTP response code: 401 for URL: http://www.nature.com/nchem/journal/v7/n4/full/nchem.2206.html

By exploring the stack trace, I found out that the problem occurred right after the connection is established (in a BoilerPipe class) :

final URLConnection conn = url.openConnection();
final String ct = conn.getContentType(); // The Exception is thrown here !

I also encountered error 403 on other websites while able to watch the article on my web browser. How to avoid this problem ?

Thank you !

EDIT - UPDATE : I managed to solve the 403 error problem by adding the following line after opening the connection :

conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");

But I still haven't solved the 401 problem. I went on my web browser searching in the header for informations, I found out that even the browser gets 401 error, but can still get the information. I made a print screen : Image http://img11.hostingpics.net/pics/757747error401.png

Now I don't even know if getting the text is possible by just using the url which works on my web browser... If someone can help me, that would be great ! :)

EDIT - UPDATE 2 : I explored the network and found several links which returned 200 (basically a few changes from the first link but with a lot of GET parameters) but it still returned a 401 error, so I don't know what to use. There was also some 302/303 redirections, with no more results.

EDIT - UPDATE 3 : Maybe rephrasing it would make things clearer : Is there a way that my URLConnection can follow the "path" of the requests as a web browser would do ?

How to avoid server error 401 (and 403) while using boilerpipe?

0 Answers0