Tried Python BeautifulSoup and Phantom JS: STILL can't scrape websites

Question

You may have seen my desperate frustrations over the past few weeks on here. I've been scraping some wait time data and am still unable to grab data from these two sites

http://www.centura.org/erwait

http://hcavirginia.com/home/

At first I tried BS4 for Python. Sample code below for HCA Virgina

from BeautifulSoup import BeautifulSoup
import requests

url = 'http://hcavirginia.com/home/'
r = requests.get(url)

soup = BeautifulSoup(r.text)
wait_times = [span.text for span in soup.findAll('span', attrs={'class': 'ehc-er-digits'})]

fd = open('HCA_Virginia.csv', 'a')

for w in wait_times:
    fd.write(w + '\n')

fd.close()

All this does is print blanks to the console or the CSV. So I tried it with PhantomJS since someone told me it may be loading with JS. Yet, same result! Prints blanks to console or CSV. Sample code below.

var page = require('webpage').create(),
url = 'http://hcavirginia.com/home/';

page.open(url, function(status) {
if (status !== "success") {
    console.log("Can't access network");
} else {
    var result = page.evaluate(function() {

        var list = document.querySelectorAll('span.ehc-er-digits'), time = [], i;
        for (i = 0; i < list.length; i++) {
            time.push(list[i].innerText);
        }
        return time;

    });
    console.log (result.join('\n'));
    var fs = require('fs');
    try 
    {                   
        fs.write("HCA_Virginia.csv", '\n' + result.join('\n'), 'a');
    } 
    catch(e) 
    {
        console.log(e); 
    } 
}

phantom.exit();
});

Same issues with Centura Health :(

What am I doing wrong?

Try [ghost.py](http://jeanphix.me/Ghost.py/) - it should load all the JS for you. It may be a bit slow though, but I'll reccomend you to check it out. (disclaimer: I haven't gone over your code yet) — Steinar Lima, Feb 25 '14 at 23:57
Seems like `ghost.py` is quite buggy - it worked for my use last time, but now my testing scripts exits with error.. :/ — Steinar Lima, Feb 26 '14 at 00:37
I think it's some sort of load issue with the page maybe? I've used BS4 and Phantom successfully on many other sites. These sites throw it off... — JJThaeler, Feb 26 '14 at 01:37

Steinar Lima · Accepted Answer · 2015-10-01T13:24:36.903

The problem you're facing is that the elements are created by JS, and it might take some time to load them. You need a scraper which handles JS, and can wait until the required elements are created.

You can use PyQt4. Adapting this recipe from webscraping.com and a HTML parser like BeautifulSoup, this is pretty easy:

(after writing this, I found the webscraping library for python. It might be worthy a look)

import sys
from bs4 import BeautifulSoup
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import * 

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()   

url = 'http://hcavirginia.com/home/'
r = Render(url)
soup = BeautifulSoup(unicode(r.frame.toHtml()))
# In Python 3.x, don't unicode the output from .toHtml(): 
#soup = BeautifulSoup(r.frame.toHtml()) 
nums = [int(span) for span in soup.find_all('span', class_='ehc-er-digits')]
print nums

Output:

[21, 23, 47, 11, 10, 8, 68, 56, 19, 15, 7]

This was my original answer, using ghost.py:

I managed to hack something together for you using ghost.py. (tested on Python 2.7, ghost.py 0.1b3 and PyQt4-4 32-bit). I wouldn't recommend to use this in production code though!

from ghost import Ghost
from time import sleep

ghost = Ghost(wait_timeout=50, download_images=False)
page, extra_resources = ghost.open('http://hcavirginia.com/home/',
                                   headers={'User-Agent': 'Mozilla/4.0'})

# Halt execution of the script until a span.ehc-er-digits is found in 
# the document
page, resources = ghost.wait_for_selector("span.ehc-er-digits")

# It should be possible to simply evaluate
# "document.getElementsByClassName('ehc-er-digits');" and extract the data from
# the returned dictionary, but I didn't quite understand the
# data structure - hence this inline javascript.
nums, resources = ghost.evaluate(
    """
    elems = document.getElementsByClassName('ehc-er-digits');
    nums = []
    for (i = 0; i < elems.length; ++i) {
        nums[i] = elems[i].innerHTML;
    }
    nums;
    """)

wt_data = [int(x) for x in nums]
print wt_data
sleep(30) # Sleep a while to avoid the crashing of the script. Weird issue!

Some comments:

As you can see from my comments, I didn't quite figure out the structure of the returned dict from Ghost.evaluate(document.getElementsByClassName('ehc-er-digits');) - its probably possible to find the information needed using such a query though.
I also had some problems with the script crashing at the end. Sleeping for 30 seconds fixed the issue.

I've vote this up but I need 15 reputation points to do it. Thanks! — JJThaeler, Feb 26 '14 at 13:54
Thanks! The only change I had to make to the `pyQt4` answer was to change `soup = BeautifulSoup(unicode(r.frame.toHtml()))` to `soup = BeautifulSoup(r.frame.toHtml())` — dstudeba, Oct 01 '15 at 11:59
Cool that it worked, @dstudeba! Do you by any chance use Python 3.X? I haven't worked with BS for a while, but unicode is standard in Python 3, so I guess there is no `unicode` function either :) — Steinar Lima, Oct 01 '15 at 13:16
Yes, I was using Python 3.4, I can't try it on an earlier version of 3, since another package I am using is dependent on 3.4. Thanks again! — dstudeba, Oct 01 '15 at 13:20
That would explain it, @dstudeba. I'll update my answer to help future readers that want to apply this solution in a Python 3 environment. — Steinar Lima, Oct 01 '15 at 13:22

Tried Python BeautifulSoup and Phantom JS: STILL can't scrape websites

1 Answers1

Linked