0

I want to scrape a website using Google Apps Script. unfortunately, it giving me an error 406.

below is the full details of the error:

resp:
"<html><head><title>Error 406 - Not Acceptable</title><head><body><h1>Error 406 - Not Acceptable</h1><p>Generally a 406 error is caused because a request has been blocked by Mod Security. If you belie…"

Below is a sample of the code:

var ss = SpreadsheetApp.getActiveSpreadsheet();

var options = {

  'method' : 'get',

  'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)
 
Chrome/51.0.2704.103 Safari/537.36',

  'muteHttpExceptions': true,

  Authorization: 'Bearer ?????',

  Accept: 'application/json',

  'Content-Type': 'application/json',

}


function scrapeJobs(){

  var result = [];

  for(var i =1; i <= 2; i++){

    var url = 'https://www.xxxxxxx/page/'  + i;

   var resp = UrlFetchApp.fetch(url, options,).getContentText();

var $ = Cheerio.load(resp);

var jobList = $("#titlo > strong > a");

var urls =  jobList.map(function() {return $(this).attr('href');}).toArray();

// debug code - outputs the urls it collected

console.log(urls);

for (let i = 0; i < urls.length; i++) {

  var data = scrapeJobDetails(urls[i]);

  if (data != null) {

    result.push(...data);

  }

  }

 }

I tried following the answer to a similar question asked here but no success. see the link below:

Google App Script external API return error 406

Rubén
  • 34,714
  • 9
  • 70
  • 166
lily
  • 1
  • Your request only accepts `application/json`, but the website you want to scrape probably is `text/html`. Consider which headers you really want to set in your request. – Heiko Theißen Dec 28 '22 at 14:23
  • hello @HeikoTheißen. thanks for your response. so, how should my header be then if website is text/html? edit my code please – lily Dec 28 '22 at 15:06

2 Answers2

1

To fetch a website for scraping, simply use

function myFunction() {
  Logger.log(UrlFetchApp.fetch('<url of website>', {
    muteHttpExceptions: true
  }).getContentText());
}

(When running myFunction in the Apps Script editor, this logs the contents of the website.)

The muteHttpExceptions is necessary if you want to scape also error pages like "404 Not Found". (Without it, such a page leads to a catchable exception.)

Rubén
  • 34,714
  • 9
  • 70
  • 166
Heiko Theißen
  • 12,807
  • 2
  • 7
  • 31
1

The reference to ModSecurity is very odd. ModSec does not return status code 406 unless with special configuration. So the whole "Generally a 406 error is caused because a request has been blocked by Mod Security ..." is probably a generic message misleading you here.

dune73
  • 339
  • 1
  • 3