0

Up until just this week I was able to use a simple html dom parser to scrape content off google scholar. (Yes I'm aware they don't want people doing that hence no API).

Yet in the past day or two it's stopped displaying content. When attempting a simple file_get_html or a url there is an error of:

Server Error We're sorry but it appears that there has been an internal server error while processing your request. Our engineers have been notified and are working to resolve the issue.Please try again later.

I've seen other questions out there, but the solutions are mostly R specific or are using cURL. Does anyone have suggestions to tweak my simple php function, especially to call twice? Or am I out of luck as Google is now closing this door?

My code:

<?php require_once('assets/functions/simple_html_dom.php');
$google_id = get_post_meta($post->ID, 'ecpt_google_id', true);
$google = new simple_html_dom;
$google_url = 'http://scholar.google.com/citations?user=' . $google_id . '&pagesize=10';
$older_pubs = 'http://scholar.google.com/citations?user=' . $google_id;
$google = file_get_html($google_url);

foreach($google->find('tr.gsc_a_tr') as $article) {
    $item['title']  = $article->find('td.gsc_a_t a', 0)->plaintext;
    $item['link']   = $article->find('a.gsc_a_at', 0)->href;
    $item['pub']    = $article->find('td.gsc_a_t .gs_gray', 1)->plaintext;
    $item['year']   = $article->find('td.gsc_a_y', 0)->plaintext;

    ?>
    <p class="pub"><b><a href="http://scholar.google.com<?php echo $item['link'];?>"><?php echo $item['title']; ?></a></b></p>
    <h6 class="pub"><?php echo $item['year']; ?>, <?php echo $item['pub']; ?></h6>


    <?php } ?>
<p align="right"><b><a href="<?php echo $older_pubs; ?>">View Publications</a></b></p>
Community
  • 1
  • 1
timmyg
  • 181
  • 1
  • 14

1 Answers1

1

Google scholar is not accessible without accepting cookies anymore. An "server error" occurs if you try to access with curl/wget/...

Try to accept cookies, for curl/php see: Google Server gives a server error with the first request in private browsing mode

Then load page twice (first accepting cookie and server error, second you get content.)

Community
  • 1
  • 1
Markus
  • 26
  • 1
  • Thanks! I used the code snippet in the answer and it's working great! However, there's no way around forcing the user to refresh the page, is there? – timmyg Nov 19 '15 at 18:17
  • Yes, there is. Basically there are two ways. (1) You can load the page twice, like: curl_exec($curl); // sets the cookie $data = curl_exec($curl); // loads the real data or (2): you can use one cookie for all of your visitors. Change $config['cookie_file'] = $dir . '/cookies/' . md5($_SERVER['REMOTE_ADDR']) . '.txt'; to $config['cookie_file'] = '/tmp/myscholarcookie.txt'; The first solution might slow down the loading speed of your page, so in this case I would prefer (2). – Markus Nov 20 '15 at 02:23
  • thanks so much! this fixed the refreshing requirement. – timmyg Nov 23 '15 at 16:55