3

I'm trying to retrive some data from a web site.

I wrote a java class which seems to work pretty fine with many sites but it doesn't work with this particular site, which use extensive javascript in the input fomr.

As you can see from the code I specified the input fields taking the name from the HTML source, but maybe this website doesn't accept POST request of this kind?

How can I simulate an user-interaction to retrieve the generated HTML?

package com.transport.urlRetriver;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;

import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.NameValuePair;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.message.BasicNameValuePair;

public class UrlRetriver {


    String stationPoller (String url, ArrayList<NameValuePair> params) {

        HttpPost postRequest;
        HttpResponse response;
        HttpEntity entity;
        String result = null;

        DefaultHttpClient httpClient = new DefaultHttpClient();


        try {

            postRequest = new HttpPost(url);

            postRequest.setEntity((HttpEntity) new UrlEncodedFormEntity(params));
            response = httpClient.execute(postRequest);

            entity = response.getEntity();

            if(entity != null){
              InputStream inputStream = entity.getContent();
              result = convertStreamToString(inputStream);
            }



        } catch (Exception e) {

            result = "We had a problem";

        } finally {

            httpClient.getConnectionManager().shutdown();

        }



        return result;

    }





    void ATMtravelPoller () {




        ArrayList<NameValuePair> params = new ArrayList<NameValuePair>(2);

        String url = "http://www.atm-mi.it/it/Pagine/default.aspx";

        params.add(new BasicNameValuePair("ctl00$SPWebPartManager1$g_afa5adbb_5b60_4e50_8da2_212a1d36e49c$txt_address_s", "Viale romagna 1"));

        params.add(new BasicNameValuePair("ctl00$SPWebPartManager1$g_afa5adbb_5b60_4e50_8da2_212a1d36e49c$txt_address_e", "Viale Toscana 20"));

        params.add(new BasicNameValuePair("sf_method", "POST"));

        String result = stationPoller(url, params);

        saveToFile(result, "/home/rachele/Documents/atm/out4.html");

    }

    static void saveToFile(String toFile, String pos){
          try{
                // Create file 
                FileWriter fstream = new FileWriter(pos);
                BufferedWriter out = new BufferedWriter(fstream);
                out.write(toFile);
                //Close the output stream
                out.close();
                }catch (Exception e){//Catch exception if any
                  System.err.println("Error: " + e.getMessage());
                }
              }

    private static String convertStreamToString(InputStream is) {
          BufferedReader reader = new BufferedReader(new InputStreamReader(is));
          StringBuilder stringBuilder = new StringBuilder();

          String line = null;
          try {
            while ((line = reader.readLine()) != null) {
              stringBuilder.append(line + "\n");
            }
          } catch (IOException e) {
            e.printStackTrace();
          } finally {
            try {
              is.close();
            } catch (IOException e) {
              e.printStackTrace();
            }
          }
          return stringBuilder.toString();
        }

}
Mascarpone
  • 2,516
  • 4
  • 25
  • 46
  • 1
    This is not an answer but rather a description of what happens. There are about 30 parameters you need to submit and some parameter names/values are dynamically generated to prevent getting content by a script or program. You hard coded the paramter names which changes every time you GET the content. Those parameters will not be the same. – gigadot Jan 23 '11 at 19:11
  • 2
    not an answer for your JavaScript thinggy (hence the comment) but... Note that for a lot of sites you'll need to fake your "user agent" from Java otherwise you won't get the real site. Been there, done that, you **must** fake the user agent ;) – SyntaxT3rr0r Jan 23 '11 at 19:13
  • 1
    For this web site, it makes no different whether you send user agent or not. I tested it by filtering out the user agent header from my firefox and the results are no different. – gigadot Jan 23 '11 at 19:32
  • gigadot do you think that there might be a workaround for this? – Mascarpone Jan 23 '11 at 20:20
  • 2
    If I were you, I would try to do Http GET for the content of the page first. Then, get all input tags (including hidden type) within the form tag and create a list of all name/value pairs for Http POST. Ignore the java scriptbits for now. If it doesn't work then we have to rethink about it. – gigadot Jan 23 '11 at 21:06

1 Answers1

1

At my point of view, there could be javascript generated field with dynamic value for preventing automated code to crawl the site. Send concrete site you want to download.

michal.kreuzman
  • 12,170
  • 10
  • 58
  • 70
  • I already inserted it in the original description: http://www.atm-mi.it/en/Pages/default.aspx – Mascarpone Jan 23 '11 at 20:18
  • 1
    As gigadot wrote above, you have to do GET request to get hidden fields (as I can see __REQUESTDIGEST makes a problem) and then make a POST request. In general act like a user in browser. – michal.kreuzman Jan 24 '11 at 09:41