1

I am working on scraping some data on a specific Web page:

http://www.studenti.ict.uniba.it/esse3/ListaAppelliOfferta.do

The data I have to scrape are showed in the table which can be obtained as the output of the search which can be achieved by selecting one "Facoltà", one "Dipartimento" and then clicking on "Avvia Ricerca".

I am very glad to say I was able to scrape 100% of the data in the table using JSoup, but in order to do so I need the HTML source code of the page containing the table.

The only way I was able to get that HTML is by manually selecting one "Facoltà", one "Dipartimento" and then clicking on "Avvia Ricerca". Then the table is showed and I can obtain the HTML of the whole page containing it by right clicking and downloading the source code.

I want to write some Java code which allows to automate these steps, after I give to my program the above mentioned url:

  1. selecting "Dipartimento di Informatica" among Facoltà
  2. selecting "Informatica" (or one of the others available)
  3. clicking "Avvia Ricerca"
  4. downloading the HTML source code of the Web page in .html file

So then I can apply the code I wrote by myself for scraping the data in the table I need.

Is there any library or something of this kind that can help me? I am sure there is no need to re-invent the wheel on this matter.

Please note I tried some code to do that:

try{
  URL url= new URL("http://www.studenti.ict.uniba.it/esse3/ListaAppelliOfferta.do");
  URLConnection urlConn = url.openConnection();
  BufferedReader dis= new BufferedReader(new InputStreamReader((url.openStream())));
  String s="";
  while (( s=dis.readLine())!= null) {
  System.out.println(s);
  }
  dis.close();
  }catch (MalformedURLException mue) {}
  catch (IOException ioe) {}

But in this way I obtain only the HTML code of the page still not containing the table I need to scrape data from.

NoobNe0
  • 385
  • 1
  • 6
  • 20
  • If you can come up with a single HTTP request that would retrieve the desired web page in the browser, you could use [this](http://stackoverflow.com/a/2587000/3388491) approach. – oschlueter Mar 15 '14 at 12:54
  • 1
    You say you use JSoup; why don't you use it to slurp the URL to start with? – fge Mar 15 '14 at 13:49
  • Can you please explain how can I do that? – NoobNe0 Mar 16 '14 at 18:25

0 Answers0