Summary: how to write regexp for Google App Script that will fwtch all anchors of links from html
My Task: I have GoogleSpreadsheet with URLs where links to my website are (webmaster->links to me -> export). I need anchors crawler (using google app script) to see whick links are spammed.
Realisation (what I can do):
function doGetLinks(url, link, encoding)
{
var encoding = "windows-1251";
Utilities.sleep(1000);
var page = UrlFetchApp.fetch(url).getContentText(encoding);
var matched = page.match(/<a\s+(?:[^>]*?\s+)?href\s*=\s*(\"([^"]*\")|'[^']*'|([^'">\s]+)).*<\/a>/gim);
var amt = "$0";
if (matched != null)
{
for (var i in matched)
{
var anchor = matched[i];
amt = anchor + " | ";
}
}
return amt;
}
how to see it:
- write any cell formula =doGetLinks("http://4uarticles.net/15295/insulating-oil-reconditioning/", "articlesynergy.com")
Problems (what I can't):
- how to write regexp for to returns anchors only
- how to force it to return all matching links (now only first is returned, althoug key /g is used)
- how to inbuild variable 'link' in regexp -- it has no quotes to do it. but i need to see links only to my website