-2

I want to pull public data from a government website, but there is no API to expose that information directly. Attempts to use IMPORTXML with XPath is running into a warning banner that appears if you don't have an active session cookie, and returning no data as a result.

Page requested: https://www.dibbs.bsm.dla.mil/rfq/rfqrecs.aspx?category=nsn&value=7110-00-001-2667

Page received: https://www.dibbs.bsm.dla.mil/dodwarning.aspx?goto=/rfq/rfqrecs.aspx?category=nsn&value=7110-00-001-2667

This is all publicly accessible information, no login credentials required. Is there a way I can "click" the OK button on the warning page so that IMPORTXML accesses the correct page instead?

Rubén
  • 34,714
  • 9
  • 70
  • 166
Erich
  • 2,408
  • 18
  • 40
  • 1
    Scraping data from web pages is an exercise in reverge-engineering HTTP requests. When you navigate to the data manually, capture the HTTP requests and responses involved. Replicate those requests in your code. The "click" you refer to surely performs some action. Replicate that action. – David Oct 18 '17 at 13:56
  • my original question was in hopes that there was a built-in method within google sheets for acquiring and storing a session cookie for importxml to work. guess it's back to the drawing board and learning how to use gas. – Erich Oct 20 '17 at 19:16

1 Answers1

1

Taking cues from David's comment above, I used this question as a basis for displaying cookies using Google Apps Script:

var _URL = "https://www.dibbs.bsm.dla.mil/Rfq/RFQRecs.aspx?TypeSrch=cq&category=nsn&value=7110-00-001-2667";

function getData(_URL) {
  var opt = {
    "method" : "post",
    "User-Agent" : "Mozilla/5.0",
    "Accept" : "text/html,application/xhtml+xml,application/xml",
    "Accept-Language" : "en-US,en;q=0.5",    
    "followRedirects" : true
  };
  var response = UrlFetchApp.fetch(url,opt);
  var headers = response.getAllHeaders();
  var sessioncookie = headers['Set-Cookie']; 
  Logger.log(sessioncookie); 

  opt = {
    "method" : "get",
    "User-Agent" : "Mozilla/5.0",
    "Accept" : "text/html,application/xhtml+xml,application/xml",
    "Accept-Language" : "en-US,en;q=0.5",    
    "headers" : {
      "Cookie" : sessioncookie
    },
    "followRedirects" : true    
  };
  var content = UrlFetchApp.fetch(url, opt).getContentText();
  Logger.log("File size: " + content.length);

...

}

This returned a cookie called "ASP.NET_SessionId", which looked like this:

ASP.NET_SessionId=y0p5fp1cjl040p1ncr20h2gc; path=/; secure; HttpOnly

I passed this cookie back in the following HTTP request, hoping to get further. But I still wasn't able to bypass the warning page. In the process of troubleshooting, I got used to going into my Chrome settings and clearing cookies for this site, but then noticed that this particular site had set not one, but three different cookies, including one called "DIBBSDoDWarning" with its content simply being the string "AGREE". Hmm, could that do something?

Experimenting a bit, I found that I could simply send just this one cookie from the outset in a single request to get the page I wanted.

var opt = {
  "method" : "get",
  "User-Agent" : "Mozilla/5.0",
  "Accept" : "text/html,application/xhtml+xml,application/xml",
  "Accept-Language" : "en-US,en;q=0.5",    
  "headers" : {
    "Cookie" : "DIBBSDoDWarning=AGREE; path=/; secure; HttpOnly"
  },
  "followRedirects" : true
};
var content = UrlFetchApp.fetch(url, opt).getContentText();

There is no IMPORTXML support in Google Apps Script to easily scrape a webpage using Xpath, so what still remains to be done is figure out how do this more elegantly than I'm doing now. I tried using XmlService.parse() to return a Document, but the script consistently fails when it reaches this point (not sure if this page is malformed), so my fallback was a simple string search, attempting to simply get the number of results returned:

var pos = content.search('id="ctl00_cph1_lblRecCount"')
var recordCount = content.substr(pos+40,22).match(/\d+/).join();

Will update if I figure out a good general Xpath-oriented solution.

Erich
  • 2,408
  • 18
  • 40