8

Hi, is there a way to download the BibTeX entry for something from Google Scholar using PHP without having to download the BibTeX manually one by one? For example, setting a search value like "research" and then downloading the related BibTeX from the links automatically through code.

Any help would be appreciated. I tried to get the HTML page, but as I try to get the page contents the "Import to BibTeX" link disappears on the retrieved page contents.

My code:

<?php
$url = 'http://scholar.google.com/scholar?q=honors+college&amp;hl=en&amp;btnG=Search&     amp;as_sdt=1%2C4&amp;as_sdtp=on';
$needle = 'Import into bibtex';
$contents = file_get_contents($url);
echo $contents;
if(strpos($contents, $needle)!== false) {
echo 'found';
} else {
echo 'not found';
}
?>
bouteillebleu
  • 2,456
  • 23
  • 32
jarus
  • 1,853
  • 11
  • 44
  • 76
  • A lot of google's web-based interfaces are heavily javascript dependent, which your screen scaper can't handle. You'd have to figure out what's happening in the background to replicate it via scripting. – Marc B Nov 21 '11 at 20:03
  • I think, the "Import into bitex" link is only displayed when you're logged in. Try to login into Google (which I don't know how to do programatically) and then fetch the Scholar page. – koppor May 05 '12 at 09:13

2 Answers2

1

The short answer is No you cannot do this

Google does not provide API's for search / scholar and uses firm rate-limitation. The problem is that for each BibTex entry you need 2 additional requests (1 for the query, 1 for the 'import link' and a final one to get the actual BibTex entry content)

I wrote a script that scrapes google scholar results and finds the BibTex links and saves the results. However, due to the rate limit is not viable and will get blocked almost instantly.

Code can be viewed here: https://gist.github.com/Tessmore/11099509 and is free of use, but at your own risk.

Tessmore
  • 1,054
  • 1
  • 9
  • 23
1

As Tessmore said - you can't. But you can make it work by using Google Scholar Organic Results API from SerpApi that bypasses quota limits and blocks from search engines so you don't have to think about how to reduce the chance of being blocked.

Example:

toc_02


Install google-search-results-php package first via composer:

$ composer require serpapi/google-search-results-php:2.0

Code to integrate and full example in the online IDE:

<?php
ini_set("display_errors", 1);
ini_set("display_startup_errors", 1);
error_reporting(E_ALL);

require __DIR__ . "/vendor/autoload.php";

function getResultIds () {
    $result_ids = array();

    $params = [
        "engine" => "google_scholar", // parsing engine
        "q" => "biology"              // search query
    ];
    
    $search = new GoogleSearch(getenv("API_KEY"));
    $response = $search->get_json($params);
    
    foreach ($response->organic_results as $result) {
        // print_r($result->result_id);
        
        array_push($result_ids, $result->result_id);
    }

    return $result_ids;
}

function getBibtexData () {
    $bibtex_data = array();

    foreach (getResultIds() as $result_id) {
        $params = [
            "engine" => "google_scholar_cite",  // parsing engine
            "q" => $result_id
        ];
    
        $search = new GoogleSearch(getenv("API_KEY"));
        $response = $search->get_json($params);

        foreach ($response->links as $result) {
            if ($result->name === "BibTeX") {
                array_push($bibtex_data, $result->link);
            }
        }
    }
    
    return $bibtex_data;
}

print_r(json_encode(getBibtexData(), JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES));
?>

Output:

[
    "https://scholar.googleusercontent.com/scholar.bib?q=info:KNJ0p4CbwgoJ:scholar.google.com/&output=citation&scisdr=CgXjqB_WGAA:AAGBfm0AAAAAYkm8amenawYn_EBidiCQT5QBh0L1KJEX&scisig=AAGBfm0AAAAAYkm8at9X4P3eIWKUCOc6UriCEDKVsQE0&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:6zRLFbcxtREJ:scholar.google.com/&output=citation&scisdr=CgWhqfi6GAA:AAGBfm0AAAAAYkm8bDoIhTlfTkQFCOzYGax54Bst576o&scisig=AAGBfm0AAAAAYkm8bMe_7Nq4e4pB5lg_eR9jmeGrO8ek&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:6Yb0qOX88FMJ:scholar.google.com/&output=citation&scisdr=CgXn_4MdGAA:AAGBfm0AAAAAYkm8bi8ypCZcFDNEQZYZeoSlvx-U1OSk&scisig=AAGBfm0AAAAAYkm8bnFMnwTWGfkfJDCNEx0C4n-aQwql&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:HFdEElNr3IgJ:scholar.google.com/&output=citation&scisdr=CgXKCFpQGAA:AAGBfm0AAAAAYkm8byukcQCl4WHQx-nSNp2pC1gUFSKG&scisig=AAGBfm0AAAAAYkm8b8EReTVkLwtxfth_pjwMyyY3dqts&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:bs-D_MeC14YJ:scholar.google.com/&output=citation&scisdr=CgXEUXwWGAA:AAGBfm0AAAAAYkm8bwwfMNJrffe16EaGypsem9JlmGTi&scisig=AAGBfm0AAAAAYkm8b6nWlPOQL63fXg6dV2U-JQbpyQyS&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:Rn1qFVLRfKwJ:scholar.google.com/&output=citation&scisdr=CgU-HswkGAA:AAGBfm0AAAAAYkm8cHE1YRK23eHV8nzF89Eem-Bsuz72&scisig=AAGBfm0AAAAAYkm8cDEj8ZrzZjAo2bNX-tjYYYJYQZay&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:d8thHtTwq6YJ:scholar.google.com/&output=citation&scisdr=CgXj7oe9GAA:AAGBfm0AAAAAYkm8cTYamCKGKImjdg5MQdgbxUIIHAEY&scisig=AAGBfm0AAAAAYkm8cTcop1ceKzKYvKAKtvlSQ1EdEtSN&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:IUmhOhGaDaEJ:scholar.google.com/&output=citation&scisdr=CgU0qZ2_GAA:AAGBfm0AAAAAYkm8ctCPwoihZkjbNcdEqSnwa0J3jwDy&scisig=AAGBfm0AAAAAYkm8cingBcYnEp8YRqFDFdN-FAEBgDT7&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:PWsf8O5OMQEJ:scholar.google.com/&output=citation&scisdr=CgVBAJxXGAA:AAGBfm0AAAAAYkm8c3CDKQG0Wh_lWsXU_DZxEJkwZz5y&scisig=AAGBfm0AAAAAYkm8c6I-HjAxD1Gy6FLFDRdxH_qU4OBr&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:yGvgHH8ROuIJ:scholar.google.com/&output=citation&scisdr=CgXFuhOkGAA:AAGBfm0AAAAAYkm8dD0rcSR4LQF8GgTxx865BADtXNDN&scisig=AAGBfm0AAAAAYkm8dIQhodz3rHF9IUdaCSRlhdudACNQ&scisf=4&ct=citation&cd=-1&hl=en"
]

Bibtex data from the first URL:

@article{woese2004new,
  title={A new biology for a new century},
  author={Woese, Carl R},
  journal={Microbiology and molecular biology reviews},
  volume={68},
  number={2},
  pages={173--186},
  year={2004},
  publisher={Am Soc Microbiol}
}

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35
  • When trying to download the BibTeX by following the links I still run into 403 – christianbrodbeck Mar 31 '22 at 14:29
  • @christianbrodbeck could you check one more time on replit? The reason could be because of my api key change that locates on the replit `env` file. Also added a GIF code execution example. – Dmitriy Zub Apr 01 '22 at 09:31
  • The issue is not with the code execution, it's when actually trying to retrieve the BibTeX entries using the URLs it generates. I can even paste one of your URLs (https:\/\/scholar.googleusercontent.com\/scholar.bib?q=info:YnWp49O_RTMJ:scholar.google.com\/&output=citation&scisdr=CgXCiln7GAA:AAGBfm0AAAAAYjHB1PjuGwPWg-Oc1PTDkki_-3T_pD2o&scisig=AAGBfm0AAAAAYjHB1OoX_TdI3yhMKMvdA1dCMdNG0sfZ&scisf=4&ct=citation&cd=-1&hl=en) in my browser now and get a 403. This was the first one I tried to day, I wonder whether Google blocks requests to BibTeX without requesting the citation list first? – christianbrodbeck Apr 01 '22 at 20:59
  • @christianbrodbeck my bad, I forgot to add [`JSON_UNESCAPED_SLASHES`](https://www.php.net/manual/en/json.constants.php) to don't escape `/` inside `json_encode()`. You can try to run it one more time, or have a look at the attached GIF above. Thank you for your clarification. – Dmitriy Zub Apr 02 '22 at 13:34
  • That does not change the result – If I just copy-paste a URL you extract (https://scholar.googleusercontent.com/scholar.bib?q=info:Rn1qFVLRfKwJ:scholar.google.com/&output=citation&scisdr=CgXIrtxGGAA:AAGBfm0AAAAAYkhRZbicNgtIsr2tuOzyv76m1eXIEtnc&scisig=AAGBfm0AAAAAYkhRZT-nlCvpRbO8533tqJEdsoPuKg-t&scisf=4&ct=citation&cd=-1&hl=en) into the browser and get a 403. – christianbrodbeck Apr 03 '22 at 14:08
  • @christianbrodbeck I don't get it. I just tried it one more time: run on replit, opened each URL from the terminal output. Every link was `200` with Bibtex data. One guess is that those links expire after some time or something else. I updated the GIF to show actual clicking on the first URL from the terminal output. – Dmitriy Zub Apr 03 '22 at 15:24
  • That must be it! Have to use the link right away. Thanks!! – christianbrodbeck Apr 03 '22 at 19:23
  • @christianbrodbeck Of course, hope it helps ;) – Dmitriy Zub Apr 04 '22 at 04:13
  • There does seem to be some sort of counter too, I can only download a few citations a day before getting a 403... – christianbrodbeck Apr 04 '22 at 22:14
  • @christianbrodbeck if you're using SerpApi, it shouldn't have any sort of limits. It uses dedicated proxies and a captcha solver. Feel free to [open an issue](https://github.com/serpapi/public-roadmap/issues) (if using SerpApi) with the detailed problem. Here's [how to report an issue](https://github.com/serpapi/public-roadmap#report-an-issue). You'll get a faster solution rather than in the comments here. – Dmitriy Zub Apr 05 '22 at 06:01
  • I am using serpapi to get the bibtext link, but to retrieve the bibtex itself I then use that link directly. Is there a way to retrieve the bibtex (i.e., get the content from the links that you print in your script) through serpapi? – christianbrodbeck Apr 05 '22 at 10:40
  • @christianbrodbeck A late reply. Currently, it's not available. There's an open issue at [SerpApi public-roadmap](https://github.com/serpapi/public-roadmap/issues/166) and I wrote [a workaround for it by making another request using `reqeusts`](https://github.com/serpapi/public-roadmap/issues/166#issuecomment-1144570005). – Dmitriy Zub Aug 15 '22 at 05:40