0

I'm writing some java code in order to get the raw text of some Wikipedia articles (Giving a jList of words, search them in wikipedia and extract the first sentence of the corresponding article). My GUI contains a button for which I defined the following action listener:

private void loadButtonActionPerformed(java.awt.event.ActionEvent evt) {                                           

final DefaultListModel conceptsListFilesModel = new DefaultListModel();

conceptsList.setModel(conceptsListFilesModel);

final List definitionWiki = new ArrayList();        

//Remplir la list avec la première collone de la liste
final Thread updater = new Thread(){
@Override public void run() {        
for(int i=0; i< 20 /*dataTable.getRowCount()*/ ; i++) {
conceptsListFilesModel.addElement(dataTable.getValueAt(i, 0));

try {
Object concept = conceptsListFilesModel.elementAt(i);
WikipediaParser parser = new WikipediaParser("en");
System.out.println(concept+"");
String firstParagraph = parser.fetchFirstParagraph(concept+"");
int point = firstParagraph.indexOf(".");
String firstsentence = firstParagraph.substring(0, point+1);
definitionWiki.add(i, firstsentence) ;
} catch (IOException ex) {
Logger.getLogger(Tex2TaxView.class.getName()).log(Level.SEVERE, null, ex);
}

try { Thread.sleep(1000);
} catch (InterruptedException e) {throw new RuntimeException(e) ;}
}
JOptionPane.showMessageDialog(null, "Successful loading !")  ;
}
};
updater.start(); 
} 

The WikipediaParser class:

public class WikipediaParser {

private final String baseUrl; 

public WikipediaParser(String lang) {
this.baseUrl = String.format("http://%s.wikipedia.org/wiki/", lang);
}

public String fetchFirstParagraph(String article) throws IOException {
String url = baseUrl + article;
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
Element firstParagraph = paragraphs.first();
return firstParagraph.text();
}

}

The execution generates the following list of exceptions:

nov. 30, 2011 12:42:55 AM tex2tax.Tex2TaxView$11 run
Grave: null java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:150)
at java.net.SocketInputStream.read(SocketInputStream.java:121)

at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:641)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:589)
at  
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1319)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:381)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:364)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:143)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:132)
at tex2tax.WikipediaParser.fetchFirstParagraph(WikipediaParser.java:25)
at tex2tax.Tex2TaxView$11.run(Tex2TaxView.java:595)

Need help to solve this problem

Lida
  • 77
  • 1
  • 8
  • 2
    Isn't there any kind of API for wikipedia? Color me surprised. Or is that the ominous `WikipediaParser`? Seems like there's information missing :) – Voo Nov 29 '11 at 23:59
  • I tryed to use JWPL but it did not work for me. So I prefer to access online Wikipedia. WikipediaParser is a class I wrote in order to parse the text using jSoup. – Lida Nov 30 '11 at 00:09
  • add your WikipediaParser class – Jakob Weisblat Nov 30 '11 at 00:13
  • 1
    Is reading the articles via http a requirement? If not, you could always just grab a latest dump of all Wikipedia articles from http://dumps.wikimedia.org/ and parse those... – esaj Nov 30 '11 at 20:36
  • The strange thing is that the project works sometimes and the most of time the error messages are shown. I wonder if it is not due to the Internet speed. Because yesturday, the connexion was good and the application works perfectly. Today, the connection is slow, and the project shows errors. – Lida Nov 30 '11 at 20:39
  • I tryed to use JWPL in order to work with Wikipedia articles and it was a disaster. Now I think I must access Wikipedia via internet. Is there a problem with doing so? – Lida Nov 30 '11 at 20:47

1 Answers1

0

Ensure that your URL is correct. A connection timeout usually means that there is some connectivity problem.

If you have been making many requests to wikipedia, you might get blocked.

You should also be using the Wikipedia API instead of requesting and parsing web pages. It will be much faster than requesting and parsing the HTML.

Freiheit
  • 8,408
  • 6
  • 59
  • 101