Extract matching addresses using Google Sheets Re2 regular expression syntax

Question

I am trying to extract all cell/range addresses appear with in a formula in a Google Sheets cell.

Formulas by nature might be very complicated. I tried many patterns, that works in web testers, but not on google sheets re2.

The following example shows two issues. Maybe I misreading the matching results, but as I understand the are 4 Matches.

Formula (ignore the logic):

=A$13:B4+$BC$12+$DE2+F2:G2

Regex:

((\$?[A-Z]+\$?\d+)(:(\$?[A-Z]+\$?\d+))?)

Expected result:

[A$13:B4,$BC$12,$DE2,F2:G2]

Here (if I am not misreading the results) it looks OK. I am not sure if the group matching displayed are also considered as matches as it is stated "4 matches, 287 steps"

However in google sheets returns all Match 1 results

[A$13:B4,A$13,:B4,B4]

The other matches are ignored So I guess the question is how to convert the regex to re2 syntax?

Update: Following player0 comments, maybe I was not clear. This is only a simple example, to isolate other issues I have. This one is just a string containing addresses in few relative and absolute formats. However, I am looking for a wider general solution that will fit any possible formulas that might contain formulas and references to other sheets. For example:

=(STDEVA(Sheet1!B2:B5)+sum($A$1:$A$2))*B2

Expected results here is Sheet1!B2:B5,$A$1:$A$2,B2

This formula contains two formulas and reference to another sheet. Still ignoring here from Named Ranges and other formula possible references that I am currently can not think of. Also, the square brackets [] are irrelevant, it was just way to display the results, and actually is copied from Logs as it is all done within a script.

What if you don't use any capturing groups? `\$?[A-Z]+\$?\d+(?::(?:\$?[A-Z]+\$?\d+))?` https://regex101.com/r/A5yKb5/1 — The fourth bird, Aug 12 '20 at 07:00
@Thefourthbird As I am not a regex master, I don't fully understand the grouping usage, however I do understand it will prevent the group matching. Anyway, in google sheets still returns only the first match `A$13:B4` — OJNSim, Aug 12 '20 at 07:39

player0 · Answer 1 · 2020-08-13T12:52:54.657

2

try:

=INDEX(SUBSTITUTE(TEXTJOIN(",", 1, 
 IFNA(REGEXEXTRACT(SPLIT(SUBSTITUTE(FORMULATEXT(A3), "'", "♥"), 
 "+-*/^()=<>&"), 
 "(?:.+!)?[A-Z$]+\d+(?::[A-Z$](?:\d+)?)?|(?:.+!)?[A-Z$]:[A-Z$]+"))), "♥", "'"))

or longer:

=INDEX(SUBSTITUTE(TEXTJOIN(",", 1, 
 IFNA(IFNA(REGEXEXTRACT(SPLIT(SUBSTITUTE(FORMULATEXT(A3), "'", "♥"), 
 "+-*/^()=<>"), "(?:.+!)?[A-Z$]+\d+(?::[A-Z$](?:\d+)?)?"), 
 REGEXEXTRACT(SPLIT(SUBSTITUTE(FORMULATEXT(A3), "'", "♥"), 
 "+-*/^()=<>"), "(?:.+!)?[A-Z$]:[A-Z$]+")))), "♥", "'"))

edited Aug 13 '20 at 12:52

answered Aug 12 '20 at 10:03

player0

124,011
12
67
124

@OJNSim this should cover all your needs – player0 Aug 12 '20 at 10:33
I undesrstand what you did, I only don't understand why it solved the issue or better say, why there is an issue in the first place. The fact that only the first match is returned, is the way re2 behaves? – OJNSim Aug 13 '20 at 12:26
Also, I actually do it within a script, not with the internal functions. So your suggestion is to split the all formula text into an array, and loop over the latter to (try) match each entry? To clarify, the "strait forward" approach did not work neither with internal function nor on script (where I started). – OJNSim Aug 13 '20 at 12:27
Also, as I commented on Jan's answer, when removing the `/g` [flag](https://regex101.com/r/RWTD9A/1) it also returns the only the first match. So maybe _re2_ has such a flag, and this is actually the problem? – OJNSim Aug 13 '20 at 12:33
not sure whats your followup question about... regex will always return only a specified number of groups that's why it's needed to split the input because the number of matches are variable. regex101 is a nice tool but completely useless in conjunction with google sheets – player0 Aug 13 '20 at 12:40

score 0 · Answer 2 · answered Aug 12 '20 at 07:28

0

Seems, you can use

[A-Z$]+\d+(?::[A-Z$]\d+)?

See a demo on regex101.com.

answered Aug 12 '20 at 07:28

Jan

42,290
8
54
79

Although it has removed all the grouping matches, it is still return only the first match `A$13:B4`. I now realize that it is maybe the `\g` global flag in the demo. Is there a `re2` equivalent [flag](https://regex101.com/r/RWTD9A/1)? – OJNSim Aug 12 '20 at 07:55

score 0 · Accepted Answer · answered Aug 24 '20 at 08:58

I figured out a better way without splitting, by using the /g flag. However, this works in a script and not by using Sheets internal regex function (ie REGEXEXTRACT), as I couldn't figure out how to format the regex string within a cell that will contain the /g flag and REGEXEXTRACT will accept as a valid regex.

Here's the code:

/* Find all predessesor cells of input range 
*/
function findPredecessor(rng){
 
  var formualaText = rng.getFormula();
  
  /* addMatchesRegex
  * supports all A1Notation addresses 
  * the 2nd regex after the | operator will match all column addresses (A:A, Sheet2!b:B, etc)
  * some NamedRanges with names like NameRange1 
  * Does not support - NamedRange with names including dot, not ending with digits 
  */
  var addMatchesRegex = /(([\w .'!]+)?(\$?[A-Z]+\$?\d+\b)(:(\$?[A-Z]+\$?\d+))?)|([\w .'!]+)?[A-Z]+:[A-Z]+/gi; 
     
  var addMatches = formualaText.match(addMatchesRegex);
  
  Logger.log("%s add matched: %s",addMatches.length,addMatches);
  
  /* fullMatchRegex
  *  modify addMatches to return also strings like
  * 1. SUM, IFERROR, etc - internal sheets functions.
  * 2. NamedRanges
  * 
  */
  var fullMatchRegex = /(([\w .'!]+)?([\$A-Z.\d]*)(:(\$?[A-Z]+\$?\d*))?)/gi; 
  
  // match regex with formula
  var fullMatches =  formualaText.match(fullMatchRegex);
    
  Logger.log("Full matches list: %s",fullMatches);
  
  var namedRangesAdd = analyzeMatch(addMatches,fullMatches);
    
  Logger.log("%s total predecessors: %s",namedRangesAdd.length,namedRangesAdd);
}



/* This function accepts the two regex matches list
*  and returns one unique list of all predecessor addresses
*  @param {Array} addMatches - All A1 notation addresses 
*                              plus some of NamedRanges 
*  @param {Array} fullMatches - All A1 notation addresses,All NamedRanges,
*                               Other irrelevent matches
*/
function analyzeMatch(addMatches,fullMatches){

  /*Expected 
    First parameter - holds all A1Notation addresses as well as NamedRanges that
    their name in the form of /[A-Z]+/d+
    NamedRange with name including dot(.) or does not contain digits will not
    be on the list
    Second Parameter - contains all first list matches, as well as all NamedRanges
    names and also irrelevant matches to be filtered like function names and empty string 
  */
  
  //Full Matched Addresses to be returned
  var mAddresses = [];
  
  //Remove duplicate addresses
  var uniqueMatches = 
      addMatches.filter((item,index)=>addMatches.indexOf(item)===index); 
  
  //Get all named Ranges in spread sheet
  var nr = SpreadsheetApp.getActive().getNamedRanges();
  
  // Loop Named Ranges arr 
  nr.forEach(function(item){
  
    /* Check if the name of the current Named Range
    * is included in matches
    * 1. first in addMatches list
    * 2. only if not found in the wider list */
    
    var name = item.getName();
    
    //Check if in addmatches array
    var i = uniqueMatches.indexOf(name);
    
    //Build A1Notation address of current NamedRange 
    var rng = item.getRange();
    var add = "'" + rng.getSheet().getName() + "'!" + rng.getA1Notation();    
    
    if (i > -1){
      
      //Add the address of curr NamedRange to final list 
      mAddresses.push(add);
      //Remove curr NamedRange from list
      uniqueMatches.splice(i,1);
      
    }else if (fullMatches.includes(name)){
      // Name found - add the address of the 
      //              Named Range to matched Addresses list
      
      mAddresses.push(add);    
    }
    
  });
  
  //Add all left matched addresses to final list  
  mAddresses.push(...uniqueMatches);
  
  return mAddresses;
   
}

What makes it a bit complicated are NamedRanges. This code will match and analyze and returns one list with all predecessors addresses including the addresses of the NamedRanges.

Extract matching addresses using Google Sheets Re2 regular expression syntax

3 Answers3