3

I'm trying to download www.pandora.com/profile/stations/olin_d_kirkland HTML with Java to match what I get when I select 'view page source' from the context menu of the webpage in Chrome.

Now, I know how to download webpage HTML source code with Java. I have done it with downloads.nl and tested it on other sites. However, Pandora is being a mystery. My ultimate goal is to parse the 'Stations' from a Pandora account.

Specifically, I would like to grab the Station names from a site such as www.pandora.com/profile/stations/olin_d_kirkland

I have attempted using the selenium library and the built in URL getter in Java, but I only get ~4700 lines of code when I should be getting 5300. Not to mention that there is no personalized data in the code, which is what I'm looking for.

I figured it was that I wasn't grabbing the JavaScript or letting the JavaScript execute first, but even though I waited for it to load in my code, I would only always get the same result.

If at all possible, I should have a method called 'grabPageSource()' that returns a String. It should return the source code when called upon.


public class PandoraStationFinder {
    public static void main(String[] args) throws IOException, InterruptedException {
        String s = grabPageSource();
        String[] lines = s.split("\n\r");
        String t;
        ArrayList stations = new ArrayList();
        for (int i = 0; i < lines.length; i++) {
            t = lines[i].trim();
            Pattern p = Pattern.compile("<a href=\"/station/\\d+\">[\\w\\s]+</a>");
            Matcher m = p.matcher(t);
            if (m.matches() ? true : false) {
                Station someStation = new Station(t);
                stations.add(someStation);
                // System.out.println("I found a match on line " + i + ".");
                // System.out.println(t);
            }
        }
    }

    public static String grabPageSource() throws IOException {
        String fullTxt = "";
        // Get HTML from www.pandora.com/profile/stations/olin_d_kirkland
        return fullTxt;
    }
}

It is irrelevant how it's done, but I'd like, in the final product, to grab a comprehensive list of ALL songs that have been liked by a user on Pandora.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
Olin Kirkland
  • 548
  • 4
  • 23
  • 2
    http://stackoverflow.com/questions/238547/how-do-you-programmatically-download-a-webpage-in-java – Joel Jul 24 '12 at 15:16
  • 3
    I already scoured the web, including stackoverflow. I didn't just post a question without trying every other solution posted. That link was useful, but does not solve the problem I have with Pandora. – Olin Kirkland Jul 24 '12 at 15:20
  • Alright, good. However, there are several people who do not do that before posting, and since you did not say you tried some of the options on that page, I thought I would post it... – Joel Jul 24 '12 at 15:26
  • When I view that page in chrome, there are only ~4700 lines of code.. Is it possible that when you are logged into pandora, that is causing the additional lines? Perhaps it has something to do with you not authenticating your self in your program. – Joel Jul 24 '12 at 15:29
  • I just used Jsoup, and it returned 2393 lines of HTML. That's 3000 less than I was expecting. I'm not worried though, since I know that Stackoverflow ultimately solves every problem in the known universe :D – Olin Kirkland Jul 24 '12 at 15:31
  • Hmmm. You may be onto something. Do you think it has to do with authentication? I will try accessing other Pandora accounts in my browser, though I think it shouldn't be a problem since the profiles are public. (By authentication, you mean that I need to be logged in with Pandora, right?) – Olin Kirkland Jul 24 '12 at 15:32
  • Yes, it did. I don't think it is authentication, or perhaps Pandora looks to see if I'm using a browser (though I don't see that making a difference)? – Olin Kirkland Jul 24 '12 at 15:35
  • I tried opening the source for http://www.pandora.com/profile/stations/doughboy49 and got the right amount, so I don't think it's authentication (I opened the source in my browser, not using Java) – Olin Kirkland Jul 24 '12 at 15:35
  • 1
    Yes, I wonder if additional items would be included in the html if you are logged in (have authenticated yourself). Also, are line lengths standard? Is the line length in a web browser window the same as in your program? – Joel Jul 24 '12 at 15:35
  • I checked being logged in and logged out (in the browser) and it made no difference. I do not think the line length would matter, since I copy the source to Notepad++ and ctrl+F through it to look for words that are personalized (like "classical medley" or "U2")- however I come up blank every time. – Olin Kirkland Jul 24 '12 at 15:39
  • Be sure to check the original question to see posted code. – Olin Kirkland Jul 24 '12 at 15:43
  • Well, I guess i would recommend stepping through with the debugger and just watching what all happens. If you cant find those personalized words you are looking for, your problem still lies with retrieving the html. – Joel Jul 24 '12 at 15:48
  • Yep. The problem still lies with retrieving the html. *Sigh* – Olin Kirkland Jul 24 '12 at 15:57
  • Pandora does not have an API, which is why I'm trying to parse the HTML. – Olin Kirkland Jul 24 '12 at 18:40
  • K I still do not have an answer. – Olin Kirkland Jul 26 '12 at 17:43
  • 1
    Have you considered using the unofficial API? It's not 'officially' supported, but is used by quite a lot of people so is probably more reliable than rolling your own? http://pan-do-ra-api.wikia.com/wiki/Pan-do-ra_API_Wiki – Erica Oct 23 '12 at 05:06

2 Answers2

4

The Pandora pages are heavily constructed using ajax, so many scrapers struggle. In the case you've shown above, looking at the list of stations, the page actually puts through a secondary request to:

http://www.pandora.com/content/stations?startIndex=0&webname=olin_d_kirkland

If you run your request, but point it to that URL rather than the main site, I think you will have a lot more luck with your scraping.

Similarly, to access the "likes", you want this URL: http://www.pandora.com/content/tracklikes?likeStartIndex=0&thumbStartIndex=0&webname=olin_d_kirkland

This will pull back the liked tracks in groups of 5, but you can page through the results by increasing the 'thumbStartIndex' parameter.

Erica
  • 2,251
  • 16
  • 21
2

Not an answer exactly, but hopefully this will get you moving in the correct direction:

Whenever I get into this sort of thing, I always fall back on an HTTP monitoring tool. I use firefox, and I really like the Live HTTP Headers extension. Check out what the headers are that are going back and forth, then tailor your http requests accordingly. As an absolute lowest level test, grab the header from a successful request, then send it to port 80 using telnet and see what comes back.

Kevin Day
  • 16,067
  • 8
  • 44
  • 68
  • Can you go into more detail? What you said sounds a lot like jargon to me, probably because I am new to using code to interact with the web. – Olin Kirkland Aug 07 '12 at 15:30
  • Sorry Olin - this is pretty basic HTTP stuff, and I'm just not sure that this is the venue to get into it... Did you try loading the extension and taking a look at how the headers go back and forth? If you are really going to dig into this, it would be worth tracking down some tutorials on how HTTP works behind the scenes. Until you've interacted with a web server using telnet, you really won't "Get" how it works (pun intended). – Kevin Day Aug 07 '12 at 17:26
  • :| I suppose I'll focus on this later. I've been a little busy and figure I will get back to the pandora project. – Olin Kirkland Aug 09 '12 at 19:56