2

So I've been working on a way to scrape the data from a page and display it (in roughly the same format as the source). I found YQL and I am finding it brilliant, except I can't figure out how to just display the whole output with nothing special (except the basic formatting)

The YQL input code is:

select * from html where url="http://directory.vancouver.wsu.edu/anthropology" and xpath="//div[@id='facdir']"

using that it returns the JSON:

http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fdirectory.vancouver.wsu.edu%2Fanthropology%22%20and%20xpath%3D%22%2F%2Fdiv%5B%40id%3D'facdir'%5D%22&format=json&callback=anthropology

I've followed the yahoo tutorials, and created the news widget among other things, but not one tutorial covered the basic view (don't need the links either, just the paragraph setup).

Like this:

Name
Title
Phone:(###)###-####
Location: Building and Room #
email@vancouver.wsu.edu

Here is what I had for output from http://christianheilmann.com, but it doesn't do anything (apparently none of her tutorials work, tried every one):

<html>
<head>
<script src="http://code.jquery.com/jquery-latest.js"></script>  
</head>
<body>
<p>
<b>Copied:</b>
</p>
<div>
<script>
function anthropology (0) {
// get the DIV with the ID $
var info = document.getElementById('facdir');
// add a class for styling
info.className = 'js';
// if it exists
if(info){
// get the info data returned from YQL
var data = o.query.results.span;
var link = info.getElementsByTagName('a')[0];
link.innerHTML = '(see all info)';
// to the main container DIV
var out = document.createElement('span');
out.className = 'info';
info.insertBefore(out,link.parentNode);
}
}
</script>
<script src='http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fdirectory.vancouver.wsu.edu%2Fanthropology%22%20and%20xpath%3D%22%2F%2Fdiv%2Fdiv%2Fdiv%2Fdiv%2Fdiv%2Fdiv%2Fdiv%2Fdiv%2Fdiv%2Fdiv%2Fdiv%22&format=json&callback=anthropology'></script>
</div>
arttronics
  • 9,957
  • 2
  • 26
  • 62
hakarune
  • 402
  • 2
  • 7
  • 20
  • Also I have searched high and low on Google and here... I either don't know the terminology to find what I'm looking for or it has not been addressed yet. – hakarune Dec 27 '12 at 03:16
  • Do you have to do it specifically with javascript or can you do some server side stuff? I've had the best success with scraping sites using ruby with the nokogiri gem. http://nokogiri.org/ It lets you select things using CSS selectors just like you would with jQuery. – Justin Dec 27 '12 at 03:19
  • It's for a webapp/phone app. We are trying to mirror the data already on the site so as to not have to update 2+ things. We just want to update the site and have it mirror throughout the apps. – hakarune Dec 27 '12 at 06:45

1 Answers1

4

I've recently completed a tutorial with a couple of jsFiddles and explain how to use YQL, XPATH, and jQuery .ajax() for a different SO Question, which will shed some light in your direction. You can see that SO Answer here.

To comply with a acceptable answer for your question, I've put together a working demo to show you how easy it is to data scrape the data from the webpage your requesting.

The jsFiddle Demo contains lots of comments and console.log() messages to understand the workflow process. Ensure you active your browsers console and use Firebug for example. The HTML and CSS used to construct the Faculty Member Boxes mimic those from the original website, including Links in the Image, Name, Email, and Webpage Theme too.

DEMO:

jsFiddle Data Scraping XML: Dynamic Webpage Building

Revised!!! In addition to revised jsFiddle above, see related

jsFiddle Tutorial: Creating Dynamic Div's (Now Improved!)

HTML:

<div id="results"></div>

jQuery:

var directoryName = 'child-development-program';

$.ajax({
    type: 'GET',
    url: "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fdirectory.vancouver.wsu.edu%2F" + directoryName + "%22%20and%20xpath%3D%22%2F%2Fdiv%5B%40id%3D'content-inner'%5D%2Fdiv%2Fdiv%2Fdiv%2Fdiv%2Fdiv%5B2%5D%22",
    dataType: 'xml',
    success: function(data) {

        if (data) {

            // Show in console the jQuery Object.
            console.info('Here is the returned query');
            console.log( $(data).find('query') );

            // Show in console the results in inner-html text.
            var textResults = $(data).find('results').text();
            console.log( textResults );

            // Parse the list of faculty members. Variable indexFM is not used for indexed faculty member.
            $(data).find('results').find('.views-row').each(function(indexFM){

                // This variable will store the current faculty member.
                var facultyMember = this;
                console.info('Faculty jQuery DIV Object shown on next lines.');
                console.log( facultyMember );

                // Parse the contents of each faculty member. Variable indexFC is not used for indexed faculty content.
                $(facultyMember).each(function(indexFC){

                    // Get Thumbnail Image of Faculty Member
                    var facultyMemberImage = $(this).find('.views-field-field-profile-image-fid #directoryimage a img').attr('src');
                    console.log( facultyMemberImage );

                    // Get Title (Name) of Faculty Member
                    var facultyMemberTitle = $(this).find('.views-field-field-professional-title-value #largetitle').text();
                    console.log( facultyMemberTitle );
                    // Get relative URL fragment.

                    //
                    // Stackoverflow Edit: Much more extraction in this section, see jsFiddle link above.
                    // 

                    // Get Email of Faculty Member
                    var facultyMemberEmail = $(this).find('.views-field-field-email-value span').text();

                    // Simple dashed line to separate faculty members as seen in browser console.
                    console.log('--------');

                    var divObject = '<div class="dynamicResults"><div class="dynamicThumb"><a href="' + facultyMemberUrl + '"><img src="' + facultyMemberImage + '" alt=""></a></div><div class="dynamicInfo"><div class="dynamicText"><a href="' + facultyMemberUrl + '" class="dynamicName">' + facultyMemberTitle + '</a></div><div class="dynamicText">' + facultyMemberPosition + '</div><div class="dynamicText">Phone: ' + facultyMemberPhone + '</div><div class="dynamicText">Location: ' + facultyMemberBuilding + ' <span>' + facultyMemberRoom + '</span></div><div class="dynamicText"><a href="' + facultyMemberEmailUrl + '" class="dynamicEmail">' + facultyMemberEmail + '</a><span class="dynamicEmailpic"></span></div></div></div><div class="clear"></div>';

                    // Build webpage with dynamic data.
                    $('#results').append( divObject );

                });

            });

        }
    }
});

Screenshot: Thumbnails in photo are 100px x 100px Revised Photo for Revised jsFiddle!!


But in really looking at your Question, I wanted to try something new and simple... the results are very acceptable however. This time, the data scraping technique is using the webpages native CSS file as an asset in the jsFiddle, while also using the returned data directly into the DOM.

This method uses the same principle as above, except it's using html as the .ajax() dataType to have available a near clone of the original webpage. The only drawback is the requirement for the whole CSS file, but you can parse an original file to remove excess styles and selectors not needed (Important as not to break the 4096 CSS Selector barrier in IE).

DEMO:

jsFiddle Data Scraping HTML: Clone That Webpage

HTML

<link type="text/css" rel="stylesheet" media="all" href="http://directory.vancouver.wsu.edu/sites/directory.vancouver.wsu.edu/files/css/css_f9f00e4e3fa0bf34a1cb2b226a5d8344.css" />

<div id="facultyAnthropology"></div>

jQuery:

var directoryName = 'anthropology';

    $.ajax({
        type: 'GET',
        url: "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fdirectory.vancouver.wsu.edu%2F"+directoryName+"%22%20and%20xpath%3D%22%2F%2Fdiv%5B%40id%3D'content-area'%5D%22",
        dataType: 'html',
        success: function(data) {
            $('#facultyAnthropology').append($(data).find('results'));
        }
    });

Screenshot: As above, Thumbnails in photo are 100px x 100px

Community
  • 1
  • 1
arttronics
  • 9,957
  • 2
  • 26
  • 62
  • This works perfect!!! Thanks so much, and the tutorial is awesome! Thanks again for the help! – hakarune Dec 28 '12 at 02:23
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/21805/discussion-between-hakarune-and-arttronics) – hakarune Dec 28 '12 at 07:49
  • 1
    I finally finished! The 1st jsFiddle #1 is a rewrite with new HTML and all new CSS. It includes links for thumbnail and faithfully recreates the original webpage theme for members boxes. Also redone is the XPATH. Per our chat in comments, I found the correct **YQL Rest Query** to address avoiding getting the header. Although in comments a solution was to skip `index[0]` and use a function, that comes with extra maintenance since that is provided for the specific XPATH you have. Change the XPATH without address that means you are throwing away needed results. The new XPATH works now. – arttronics Dec 28 '12 at 12:53
  • 1
    Extra: Above jsFiddles revised. I now include `
    ` for the last item in the `divObject` variable. That will fix the crazy shifting of items you see.
    – arttronics Dec 31 '12 at 22:06