1

Summary: how to write regexp for Google App Script that will fwtch all anchors of links from html

My Task: I have GoogleSpreadsheet with URLs where links to my website are (webmaster->links to me -> export). I need anchors crawler (using google app script) to see whick links are spammed.

Realisation (what I can do):

function doGetLinks(url, link, encoding) 
{
  var encoding = "windows-1251";
  Utilities.sleep(1000);

  var page = UrlFetchApp.fetch(url).getContentText(encoding); 
  var matched = page.match(/<a\s+(?:[^>]*?\s+)?href\s*=\s*(\"([^"]*\")|'[^']*'|([^'">\s]+)).*<\/a>/gim);

  var amt = "$0";
  if (matched != null)
  {

    for (var i in matched) 
    {
      var anchor = matched[i];        
      amt = anchor + " | ";     
   }        

  }

  return amt;
}

how to see it:

Problems (what I can't):

  1. how to write regexp for to returns anchors only
  2. how to force it to return all matching links (now only first is returned, althoug key /g is used)
  3. how to inbuild variable 'link' in regexp -- it has no quotes to do it. but i need to see links only to my website
GlobeCore
  • 11
  • 1
  • you can see sample https://docs.google.com/spreadsheet/ccc?key=0Ap5D58-gT2y7dC1IN1JtTUpzcG5PeElvQnM3SzFWUHc&usp=docslist_api#gid=0 – GlobeCore Feb 14 '14 at 20:17

1 Answers1

0

While you might be able to hardcode some scenarios, you wont cover the general case. If you dont believe me ask this guy: RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
Zig Mandel
  • 19,571
  • 5
  • 26
  • 36
  • Good recommendation. But it also suggest that framework shoud have powerfool html parsing tool. I tried -- var html = UrlFetchApp.fetch(url).getContentText(); var doc = XmlService.parse(html); var html = doc.getRootElement(); it returns error for invalid xhtml that is used on real pages Also I tried: var doc = Xml.parse(page, true); var body = doc.html.body; var a = body.getElements("a"); a = a.getText(); return a; It returns no erros, but still do not work for me that is why I tried with Regexp -- that at least work – GlobeCore Feb 17 '14 at 08:57
  • Html is not valid xml in general. – Zig Mandel Feb 17 '14 at 13:38