1

I have tried to use phantomjs , cheerio in node and webBrowser control in C# to get my song list , I can get the html successfully but without song list, I can't figure out why I can't get it...

The only way I can do is copy the html by dev tool and analyze it by Jquery.

Here is my code in WinForm :

  private void Form1_Load(object sender, EventArgs e)
    {
        webBrowser1.Navigate("http://grooveshark.com/#!/shinningstar1001/collection");
        webBrowser1.DocumentCompleted += webBrowser1_DocumentCompleted;
    }

    void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
    {
        File.WriteAllText("D://test.txt", webBrowser1.DocumentText);
    }

In Cheerio :

var cheerio = require('cheerio');
var request = require('request');

var url = 'http://grooveshark.com/#!/shinningstar1001/collection';

request({
    url: url,
    headers: {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
}, function (err, resp, body) {
    $ = cheerio.load(body);
    console.log(body);        
})

I guess it is because I can't get the full document after ajax load?

But why webBrowser Control can't work too? I can see full content is loaded in the control. Any advice will really appreciate.

I've tried @Murray Foxcroft solution still can't get the exact html which I want: enter image description here

Additional question

By @Murray Foxcroft solution, I can get 8% of the list content, but why can't I get the full song list that pipe into the page? For example, I can get the song "Set me free" which is around 40th in list but can't get "This Love" which is around 70th in the song list. (Two song is on the site for sure)

        if (webBrowser1.ReadyState != WebBrowserReadyState.Complete)
            return;
        if (richTextBox1.Text.Length > 0) return;
        var songList = webBrowser1.Document.GetElementById("profile-grid");

        //try to get "This Love" that never step into the code:
        if (songList != null && songList.InnerHtml.Contains("This Love")){...}

        //"Set Me Free" is OK:
        if (songList != null && songList.InnerHtml.Contains("Set Me Free"))
        {
            richTextBox1.Text = songList.OuterHtml;                
        }        
Sing
  • 3,942
  • 3
  • 29
  • 40

1 Answers1

1

For the WebBrowser sample, does the event actually fire?

Try associating the event before the navigate:

i.e. swap the lines to the following:

webBrowser1.DocumentCompleted += webBrowser1_DocumentCompleted;

webBrowser1.Navigate("http://grooveshark.com/#!/shinningstar1001/collection");

Also, DocumentCompleted may fire for every child document (like a CSS Style sheet), so make sure you are catching the event for the URL you are after.

void BrowserDocumentCompleted(object sender,
        WebBrowserDocumentCompletedEventArgs e)
{
  if (e.Url.AbsolutePath != (sender as WebBrowser).Url.AbsolutePath)
    return; 

  //The page is finished loading 
}

Further details here: Detect WebBrowser complete page loading

Final solution - the content is piped in to the main page from another source so looking for the target div is about the best solution:

 private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            // If the ReadyState is Complete then the page or an iFrame within have completed downloading.  
            if (webBrowser1.ReadyState != WebBrowserReadyState.Complete)
                return; 

            // Ensures only the first match of page-content is resturned to the RichTextBox.
            // If this does not contain what you are looking for then you may need to find an 
            // additional way to refine for the content you are after. 
            if (richTextBox1.Text.Length > 0) return;

            // Check to see if we have got the page-content div in our result source 
            // and set the richtextbox if we have it.
            var songList = webBrowser1.Document.GetElementById("page-content");
            if (songList != null)
            {
                richTextBox1.Text = songList.OuterHtml;
            }
        }
Community
  • 1
  • 1
Murray Foxcroft
  • 12,785
  • 7
  • 58
  • 86
  • still can't get the exactly content refer to the picture :( – Sing Dec 30 '14 at 15:53
  • But if u use chrome dev tool and browse the site, you can find that id with Ctrl+f and find the song list inside, that is what I can't figure out. – Sing Dec 30 '14 at 16:15
  • Tested solution added to the answer - look for "Final solution" – Murray Foxcroft Dec 30 '14 at 21:18
  • Wow thank you this is amazing, however I just can get like 1/10 of the song list. Please refer to article updated :) – Sing Dec 31 '14 at 06:40
  • Hi Andy, you'll have to keep digging on this one. The answer will lie in the HTML, use Chrome Dev Tools to explore further and get at the right elements. – Murray Foxcroft Dec 31 '14 at 08:14
  • I find the count of the song in DOM never change, its content change dynamically when I scrolling, do you have any idea to get all of it? – Sing Jan 02 '15 at 16:58
  • Try scrolling the browser window: https://social.msdn.microsoft.com/Forums/windows/en-US/38d8f6b2-d9e1-4d16-9254-b3f153bb1f6a/programatically-scroll-the-c-browser-window – Murray Foxcroft Jan 03 '15 at 14:14
  • I've tried it, but I found it is not browser but scrollable div inside, I'll try to find a way to scroll it. Thanks for your help:) – Sing Jan 05 '15 at 02:47