0

I am trying to get the html code from the website: http://www.phila.gov/water/swmap/Parcel.aspx?parcel_id=544393

using the following code:

public class URLGetter {
    private URL url;
    private HttpURLConnection connection;

    public URLGetter(String url) throws MalformedURLException {
        try {
            this.url = new URL(url);

            URLConnection connection2 = this.url.openConnection();
            connection = (HttpURLConnection) connection2;
        } catch (IOException E) {
        }
    }

    public ArrayList<String> getContents() {
        ArrayList<String> contents = new ArrayList<String>();
        try {
            Scanner in = new Scanner(connection.getInputStream());
            while (in.hasNextLine()) {
                contents.add(in.nextLine());
            }
        } catch (Exception e) {
        }
        return contents;
    }

}

Using the very simple testing method:

public class URLTester {

    public static void main(String[] args) throws MalformedURLException {
        URLGetter get = new URLGetter("http://www.phila.gov/water/swmap/Parcel.aspx?parcel_id=544393/print/textversion.html");
        ArrayList<String> list = get.getContents();

        for(String s : list){
            System.out.println(s);
        }
    }

}

I print the html. All of it goes smoothly, except for printing out the data in the tables (inside of the ) brackets. Instead of the various values which should be appearing, for example PERUTO ANGELO CHARLES III, every single value has inside of it &nbsp.

I really don't know why it does this. Looking over the textthat I get by doing this, nothing else is wrong.

edit: I've used this code on other websites and always been able to get the information I need. From this site I get all the information I need, except for the table values.

Sam Bobel
  • 1,784
  • 1
  • 15
  • 26
  • Sorry can you explain a bit more. Are you able to get the content or you want to display it in specific format? please confirm. – Braj May 03 '14 at 07:31
  • Where `HtmlCleaner cleaner = new HtmlCleaner();` is used? – Braj May 03 '14 at 07:33
  • Sorry, the HtmlCleaner line was an artifact, I was trying something else out before and forgot to delete it. I am able to get the content. Running that code gives me the full html text of the document, nearly. The only discrepancy between what the website has and what I see is that all the table data entries are changed to whitespaces. – Sam Bobel May 03 '14 at 07:38

1 Answers1

0

Do you want to convert HTML codes with equivalent characters? If yes then try below code using StringEscapeUtils#unescapeHtml()

... 
while (in.hasNextLine()) {
    contents.add(org.apache.commons.lang.StringEscapeUtils.unescapeHtml(in.nextLine()));
}
...

Note: require a jar from apache commons

For more info have a look at Replace HTML codes with equivalent characters in Java

Community
  • 1
  • 1
Braj
  • 46,415
  • 5
  • 60
  • 76
  • I think that the problem is in getInputStream somehow, or maybe in opening the connection. Somehow the correct code isn't going from the website to my computer. I'll add this up there too, but I've used this setup on other websites and the code has always been fine. And most of the code from this one is, too. Just not the table data. – Sam Bobel May 03 '14 at 07:49
  • unable to access [http://www.phila.gov/water/swmap/Parcel.aspx?parcel_id=544393/print/textversion.html](http://www.phila.gov/water/swmap/Parcel.aspx?parcel_id=544393/print/textversion.html) – Braj May 03 '14 at 08:01
  • Hi Braj, the result is the same if I use the link http://www.phila.gov/water/swmap/Parcel.aspx?parcel_id=544393/ Can you access that? – Sam Bobel May 03 '14 at 08:05
  • Hi Braj, thanks a lot for the help. I found out what the problem is, not the solution but the problem. It turns out I am returning the page source just fine. The page source is pre-javascript methods though, and what I want is dynamically called by a javascript method upon loading the page. So I need to now figure out a way to open up the javascript before reading it in. – Sam Bobel May 03 '14 at 15:47
  • @user3598519 either update this post or create a new post to make it clear for others. – Braj May 03 '14 at 15:51