-3

I am using Google Apps Script. I am trying to fetch the content inside the HTML content fetched from a web page and saved as a string, using RegEx. I want to fetch the data for the below format,

<font color="#FF0101">
        Data which is want to fetch
</font>

Which RegEx should I use to get the data contained within <font> tags (opening and closing tags). Take care of the color attribute as I only want to fetch the data from those tags which have that color attribute and value as given in the code

Mogsdad
  • 44,709
  • 21
  • 151
  • 275
SKG
  • 540
  • 1
  • 5
  • 17
  • 2
    Have a look at [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/q/1732348/218196) – Felix Kling Jan 04 '16 at 17:10

3 Answers3

2

Instead of wrestling with using RegEx to parse HTML, you can use Google Apps Script's XmlService to interpret well-formed HTML text.

function myFunction() {
  var xml = '<font color="#FF0101">Data which is want to fetch</font>';
  var doc = XmlService.parse(xml);
  var content = doc.getContent(0).getValue();
  Logger.log( content );  // "Data which is want to fetch"
  var color = doc.getContent(0).asElement().getAttribute('color').getValue();
  Logger.log( color );    // "#FF0101"
}
Mogsdad
  • 44,709
  • 21
  • 151
  • 275
0

You are using JavaScript, so you have NO excuse for trying to parse HTML with regex.

var div = document.createElement('div');
div.innerHTML = "your HTML here";

var match = div.querySelectorAll("font[color='#FF0101']");
// loop through `match` and get stuff
// e.g. match[0].textContent.replace(/^\s+|\s+$/g,'')
Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • Niet the Dark Absol, thanks for pointing that out, actually, i wrote that by mistake, I am actually using Google Appscript for that – SKG Jan 04 '16 at 17:17
  • can you now suggest something ? – SKG Jan 04 '16 at 17:19
0

If JS was fully supported, you could use a DOM-based solution.

var html = "<font color=\"#FF0202\">NOT THIS ONE</font><font color=\"#FF0101\">\n        Data which is want to fetch\n</font>";
var faketag = document.createElement('faketag');
faketag.innerHTML = html;
var arr = [];
[].forEach.call(faketag.getElementsByTagName("font"), function(v,i,a) {
    if (v.hasAttributes() == true) {
      for (var o = 0; o < v.attributes.length; o++) {
        var attrib = v.attributes[o];
        if (attrib.name === "color" && attrib.value === "#FF0101")         {
         arr.push(v.innerText.replace(/^\s+|\s+$/g, ""));
        }
      }
    }
});
document.body.innerHTML = JSON.stringify(arr);

However, acc. to the GAS reference:

However, because Apps Script code runs on Google's servers (not client-side, except for HTML-service pages), browser-based features like DOM manipulation or the Window API are not available.

You may try obtaining the inner text of <font color="#FF0101"> tags with a regex:

function myFunction() {
  var doc = DocumentApp.getActiveDocument();
  var paras = doc.getParagraphs();
  var MyRegex = new RegExp('<font\\b[^<]*\\s+color="#FF0101"[^<]*>([\\s\\S]*?)</font>','ig');
  for (i=0; i<paras.length; ++i) {
    while (match = MyRegex.exec(paras[i].getText()))
    {
      Logger.log(match[1]); 
    }
  }
}

Result against <font color="#FF0202">NOT THIS ONE</font><font color="#FF0101"> Data which is want to fetch</font>:

enter image description here

Regex matches any font tag that have color attribute with the value of #FF0101 inside double quotation marks. Mind that regexps are not reliable when parsing HTML! A better regex for this task is

<font\\b[^<]*\\s+color="#FF0101"[^<]*>([^<]*(?:<(?!/font>)[^<]*)*)</font>

In case your HTML data spans across several paragraphs, use

function myFunction() {
  var doc = DocumentApp.getActiveDocument();
  var text = doc.getBody().getText();
  var MyRegex = new RegExp('<font\\b[^<]*\\s+color="#FF0101"[^<]*>([\\s\\S]*?)</font>','ig');
  while (match = MyRegex.exec(text))
  {
     Logger.log(match[1]); 
  }
}

With this input:

<font color="#FF0202">NOT THIS ONE</font>
<font color="#FF0101">
         Data which is want to fetch
</font>

Result is:

enter image description here

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563