23

I'm trying to get the list of all digits preceding a hyphen in a given string (let's say in cell A1), using a Google Sheets regex formula :

=REGEXEXTRACT(A1, "\d-")

My problem is that it only returns the first match... how can I get all matches?

Example text:

"A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq"

My formula returns 1-, whereas I want to get 1-2-2-2-2-2-2-2-2-2-3-3- (either as an array or concatenated text).

I know I could use a script or another function (like SPLIT) to achieve the desired result, but what I really want to know is how I could get a re2 regular expression to return such multiple matches in a "REGEX.*" Google Sheets formula. Something like the "global - Don't return after first match" option on regex101.com

I've also tried removing the undesired text with REGEXREPLACE, with no success either (I couldn't get rid of other digits not preceding a hyphen).

Any help appreciated! Thanks :)

flo5783
  • 697
  • 2
  • 6
  • 19
  • 1
    I think you are otta luck on that. Documentation says only the first is returned... – dawg Apr 16 '17 at 00:19
  • 1
    @dawg Thanks, I agree, but I think there must be a way to define the right regex to get all matches; for instance if I use `((\d-))` I get the first two matches, with `(((\d-)))` I get the first three, but how to get them all, not knowing how many there are? Maybe using group name `\1`, allowing for any match repetition separated by `.*` ...or at least by combining multiple `REGEXEXTRACT` and/or `REGEXREPLACE` formulas. – flo5783 Apr 16 '17 at 01:08
  • 1
    I tried using `(?g)` which works for some re flavors but not re2 I am afraid. It is a great question. – dawg Apr 16 '17 at 04:13
  • 2
    I like the two answers using capture groups. Even simpler, =regexreplace(A1,"(\d-)|.","$1") seems to work. – Tom Sharpe Apr 20 '17 at 08:25
  • 3
    Yes I think `=regexreplace(A1,"(\d-)|.","$1") ` is the best one so far, thanks! Simple and efficient :) – flo5783 Apr 20 '17 at 17:29

8 Answers8

24

You can actually do this in a single formula using regexreplace to surround all the values with a capture group instead of replacing the text:

=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))

basically what it does is surround all instances of the \d- with a "capture group" then using regex extract, it neatly returns all the captures. if you want to join it back into a single string you can just use join to pack it back into a single cell:

enter image description here

Aurielle Perlmann
  • 5,323
  • 1
  • 15
  • 26
  • Can't get this to work. Second para fo regexextract expects a string, and the output of regexreplace doesn't match A1. – Pacerier Nov 13 '18 at 03:29
  • What are your inputs Pacerier – Aurielle Perlmann Nov 13 '18 at 03:33
  • This is brilliant. @Pacerier here's a tip: Do the RegexReplace in a cell by itself. Ensure that the output is exactly the original string except with parens () around the items you need to capture. Once that's true, then put the result into the RegexReplace as Aurielle mentioned. – JDG Jan 15 '19 at 04:54
  • Btw, if this is not working, then that probably means your string has some additional characters that is confusing the regex function, such as "[", "(", "?", "+", etc. Try stripping those out first! – 190290000 Ruble Man Jan 26 '21 at 22:25
  • Best answer so far! Any idea how to exclude the cells which don't match at all? With the current formula, if there is no match at all, the entire cell will be returned – Kostanos Dec 30 '21 at 16:49
  • This is very clever. It won't work sometimes though; for a solution that works on general inputs (if the text has characters like `(` or `?` or `+` etc), I've posted https://stackoverflow.com/a/74683624/711085 , which avoids a secondary parse with `REGEXEXTRACT` entirely. Very amusing though. – ninjagecko Dec 05 '22 at 05:23
  • This is great. I can make it work for nearly anything by assigning another cell as the regex input. For example, if I want to find if my cell contains any/all instances of words like foo, bar, hong, or dong, I can just ensure that the cell I reference in my copy of this formula contains (foo|bar|hong|dong). In this manner I can fill-down to use multiple rows doing multiple checks/extractions of my desired content. – colorful-shirts Jun 29 '23 at 06:06
11

You may create your own custom function in the Script Editor:

function ExtractAllRegex(input, pattern,groupId) {
  return [Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId])];
}

Or, if you need to return all matches in a single cell joined with some separator:

function ExtractAllRegex(input, pattern,groupId,separator) {
  return Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId]).join(separator);
}

Then, just call it like =ExtractAllRegex(A1, "\d-", 0, ", ").

Description:

  • input - current cell value
  • pattern - regex pattern
  • groupId - Capturing group ID you want to extract
  • separator - text used to join the matched results.
InSync
  • 4,851
  • 4
  • 8
  • 30
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Thanks for the effort and clear answer, although I wrote in the OP that I was not looking for a script-based solution. – flo5783 Jul 03 '20 at 23:46
  • 1
    @flo5783 You wrote "I know I could use a script", but you did not provide this script. This is meant to show this alternative for those who want to follow this solution. Since `REGEXEXTRACT` does not support multiple matching, all those `REGEXREPLACE` based solutions are work-arounds, each time needing tweaks to adapt for each situation. This little function is a **generic solution** that makes up for the missing functionality. – Wiktor Stribiżew Jul 04 '20 at 10:18
  • 3
    Thank you, this did it. Even if the OP didn't want a script, this looks to have been the only 'real' solution to the problem for anyone else stumbling upon it. – degreesightdc Aug 21 '20 at 18:00
  • @WiktorStribiżew when trying to add your function to the script editor I get the following error: ```TypeError: Cannot read property 'matchAll' of undefined (line 2, file "Code")Dismiss``` How do you think this could be fixed? Thanks! – linguist_at_large Oct 22 '20 at 12:59
  • Thanks again @WiktorStribiżew but now I'm getting ```SyntaxError: Unexpected token ')' (line 4, file "Code.gs")``` – linguist_at_large Oct 22 '20 at 13:19
  • 1
    @linguist_at_large `matchAll` is not supported in old JS environments, so you [can use](https://jsfiddle.net/wiktor_stribizew/eughbf5L/2/) `function ExtractAllRegex(input, pattern,groupId,separator) { var m, results = []; var rx = new RegExp(pattern,'g'); while (m=rx.exec(input)) { results.push(m[groupId]); } return results.join(separator); }` – Wiktor Stribiżew Oct 22 '20 at 13:24
7

Edit

I came up with more general solution:

=regexreplace(A1,"(.)?(\d-)|(.)","$2")

It replaces any text except the second group match (\d-) with just the second group $2.

"(.)?(\d-)|(.)"
  1    2    3  
  Groups are in ()
  ---------------------------------------
 "$2" -- means return the group number 2

Learn regular expressions: https://regexone.com


Try this formula:

=regexreplace(regexreplace(A1,"[^\-0-9]",""),"(\d-)|(.)","$1")

It will handle string like this:

"A1-Nutrition;A2-ActPhysiq;A2-BioM---eta;A2-PH3-Généti***566*9q"

with output:

1-2-2-2-3-

Max Makhrov
  • 17,309
  • 5
  • 55
  • 81
  • Can someone explain this formula? – Fabian Jul 21 '20 at 18:22
  • @Fabian I've added a short description. To learn it in depth I recommend studying with regular expressions. – Max Makhrov Jul 23 '20 at 08:03
  • Thank you @Max Makhrov! – Fabian Jul 23 '20 at 09:31
  • 2
    Why capture group 1 and 3? Shorter: `=regexreplace(A1,".?(\d-)|.", "$1")` – Peter Thoeny Feb 08 '21 at 21:39
  • @Peter Thoeny please consider making that an answer for ease of access as it's the simplest and clearest one. Also, what does the `|.` stands for? Thanks for sharing! – Lod Feb 15 '23 at 16:22
  • @MaxMakhrov I've adapted your great formula to this example: https://docs.google.com/spreadsheets/d/1EHKcmhEqIN_glWt_I1G1lXhlAMpPvQWzUcm8suFT-OM/edit?usp=sharing How to get it to return the 1st letter in `A19`? My current formula in `A19:A20` : `=trim(regexreplace(regexreplace(C17,".?([A-Z]{2,})|.", " $1"), "(\s)([A-Z])","$1 $2"))` Thanks! – Lod Feb 15 '23 at 19:54
  • That seems to work: `=trim(regexreplace(regexreplace(C23,"(?:([A-Z]{2,}))|.", " $1"), "(\s)([A-Z])","$1 $2"))` I'll test further. Thanks again. – Lod Feb 15 '23 at 20:03
  • @Lod: I added my answer with explanation: https://stackoverflow.com/a/75465316/7475450 – Peter Thoeny Feb 15 '23 at 20:51
5

I wasn't able to get the accepted answer to work for my case. I'd like to do it that way, but needed a quick solution and went with the following:

Input:

1111 days, 123 hours 1234 minutes and 121 seconds

Expected output:

1111 123 1234 121

Formula:

=split(REGEXREPLACE(C26,"[a-z,]"," ")," ")
  • 1
    Different problem, different solution. Your case is simpler. In my case I had other digits that were to be dismissed when not preceding a hyphen. – flo5783 Oct 14 '18 at 19:38
4

The shortest possible regex:

=regexreplace(A1,".?(\d-)|.", "$1")

Which returns 1-2-2-2-2-2-2-2-2-2-3-3- for "A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq".

Explanation of regex:

  • .? -- optional character
  • (\d-) -- capture group 1 with a digit followed by a dash (specify (\d+-) multiple digits)
  • | -- logical or
  • . -- any character
  • the replacement "$1" uses just the capture group 1, and discards anything else

Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

Peter Thoeny
  • 7,379
  • 1
  • 10
  • 20
  • Great help. Thanks to your comment I could finally come up with this one `https://stackoverflow.com/a/75452891/10789707` for my use case after half a day searching. Be well! – Lod Feb 15 '23 at 20:56
  • 1
    5 years too late for my project, but that's the kind of answer that's useful for life. Thanks! :-) – flo5783 Feb 16 '23 at 14:12
2

This seems to work and I have tried to verify it.

The logic is

(1) Replace letter followed by hyphen with nothing

(2) Replace any digit not followed by a hyphen with nothing

(3) Replace everything which is not a digit or hyphen with nothing

=regexreplace(A1,"[a-zA-Z]-|[0-9][^-]|[a-zA-Z;/é]","")

Result

1-2-2-2-2-2-2-2-2-2-3-3-

Analysis

I had to step through these procedurally to convince myself that this was correct. According to this reference when there are alternatives separated by the pipe symbol, regex should match them in order left-to-right. The above formula doesn't work properly unless rule 1 comes first (otherwise it reduces all characters except a digit or hyphen to null before rule (1) can come into play and you get an extra hyphen from "Patho-jour").

Here are some examples of how I think it must deal with the text

enter image description here

Tom Sharpe
  • 30,727
  • 5
  • 24
  • 37
  • Thanks, I still wish there was a more elegant or generic solution to fit all matches in re2, but this really does the trick in my case and I appreciate the detailed explanation and research very much! – flo5783 Apr 17 '17 at 15:46
  • Thanks, much appreciated. – Tom Sharpe Apr 17 '17 at 16:07
1

There are two general ('Excel' / 'native' / non-Apps Script) solutions to return an array of regex matches in the style of REGEXEXTRACT:

Method 1)

insert a delimiter around matches, remove junk, and call SPLIT

Regexes work by iterating over the string from left to right, and 'consuming'. If we are careful to consume junk values, we can throw them away.

(This gets around the problem faced by the currently accepted solution, which is that as Carlos Eduardo Oliveira mentions, it will obviously fail if the corpus text contains special regex characters.)

First we pick a delimiter, which must not already exist in the text. The proper way to do this is to parse the text to temporarily replace our delimiter with a "temporary delimiter", like if we were going to use commas "," we'd first replace all existing commas with something like "<<QUOTED-COMMA>>" then un-replace them later. BUT, for simplicity's sake, we'll just grab a random character such as from the private-use unicode blocks and use it as our special delimiter (note that it is 2 bytes... google spreadsheets might not count bytes in graphemes in a consistent way, but we'll be careful later).

=SPLIT(
  LAMBDA(temp,
    MID(temp, 1, LEN(temp)-LEN(""))
  )(
    REGEXREPLACE(
      "xyzSixSpaces:[      ]123ThreeSpaces:[   ]aaaa 12345",".*?(   |$)",
      "$1"
    )
  ),
  ""
)

We just use a lambda to define temp="match1match2match3", then use that to remove the last delimiter into "match1match2match3", then SPLIT it.

Taking COLUMNS of the result will prove that the correct result is returned, i.e. {" ", " ", " "}.

This is a particularly good function to turn into a Named Function, and call it something like REGEXGLOBALEXTRACT(text,regex) or REGEXALLEXTRACT(text,regex), e.g.:

=SPLIT(
  LAMBDA(temp,
    MID(temp, 1, LEN(temp)-LEN(""))
  )(
    REGEXREPLACE(
      text, 
      ".*?("&regex&"|$)", 
      "$1"
    )
  ),
  ""
)

Method 2)

use recursion

With LAMBDA (i.e. lets you define a function like any other programming language), you can use some tricks from the well-studied lambda calculus and function programming: you have access to recursion. Defining a recursive function is confusing because there's no easy way for it to refer to itself, so you have to use a trick/convention:

trick for recursive functions: to actually define a function f which needs to refer to itself, instead define a function that takes a parameter of itself and returns the function you actually want; pass in this 'convention' to the Y-combinator to turn it into an actual recursive function

The plumbing which takes such a function work is called the Y-combinator. Here is a good article to understand it if you have some programming background.

For example to get the result of 5! (5 factorial, i.e. implement our own FACT(5)), we could define:

Named Function Y(f)=LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) ) (this is the Y-combinator and is magic; you don't have to understand it to use it)

Named Function MY_FACTORIAL(n)=

Y(LAMBDA(self,

  LAMBDA(n, 
    IF(n=0, 1, n*self(n-1))
  )

))

result of MY_FACTORIAL(5): 120

The Y-combinator makes writing recursive functions look relatively easy, like an introduction to programming class. I'm using Named Functions for clarity, but you could just dump it all together at the expense of sanity...

=LAMBDA(Y,
  Y(LAMBDA(self, LAMBDA(n, IF(n=0,1,n*self(n-1))) ))(5)
)(
  LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) )
)

How does this apply to the problem at hand? Well a recursive solution is as follows:

in pseudocode below, I use 'function' instead of LAMBDA, but it's the same thing:

// code to get around the fact that you can't have 0-length arrays
function emptyList() {
  return {"ignore this value"}
}
function listToArray(myList) {
  return OFFSET(myList,0,1)
}

function allMatches(text, regex) {
  allMatchesHelper(emptyList(), text, regex)
}

function allMatchesHelper(resultsToReturn, text, regex) {
  currentMatch = REGEXEXTRACT(...)
  if (currentMatch succeeds) {
    textWithoutMatch = SUBSTITUTE(text, currentMatch, "", 1)
    return allMatches(
      {resultsToReturn,currentMatch}, 
      textWithoutMatch, 
      regex
    )
  } else {
    return listToArray(resultsToReturn)
  }
}

Unfortunately, the recursive approach is quadratic order of growth (because it's appending the results over and over to itself, while recreating the giant search string with smaller and smaller bites taken out of it, so 1+2+3+4+5+... = big^2, which can add up to a lot of time), so may be slow if you have many many matches. It's better to stay inside the regex engine for speed, since it's probably highly optimized.

You could of course avoid using Named Functions by doing temporary bindings with LAMBDA(varName, expr)(varValue) if you want to use varName in an expression. (You can define this pattern as a Named Function =cont(varValue) to invert the order of the parameters to keep code cleaner, or not.)

  • Whenever I use varName = varValue, write that instead.
  • to see if a match succeeds, use ISNA(...)

It would look something like:

Named Function allMatches(resultsToReturn, text, regex): UNTESTED:

LAMBDA(helper,
  OFFSET(
    helper({"ignore"}, text, regex),
  0,1)
)(

  Y(LAMBDA(helperItself,

  LAMBDA(results, partialText,
    LAMBDA(currentMatch,
      IF(ISNA(currentMatch),
        results,
        LAMBDA(textWithoutMatch,
          helperItself({results,currentMatch}, textWithoutMatch)
        )(
          SUBSTITUTE(partialText, currentMatch, "", 1)
        )
      )
    )(
      REGEXEXTRACT(partialText, regex)
    )
  )

  ))
)
ninjagecko
  • 88,546
  • 24
  • 137
  • 145
0

The solution to capture groups with RegexReplace and then do the RegexExctract works here too, but there is a catch.

=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))

If the cell that you are trying to get the values has Special Characters like parentheses "(" or question mark "?" the solution provided won´t work.

In my case, I was trying to list all “variables text” contained in the cell. Those “variables text “ was wrote inside like that: “{example_name}”. But the full content of the cell had special characters making the regex formula do break. When I removed theses specials characters, then I could list all captured groups like the solution did.

  • This sounds like commentary on the accepted answer? – Jeremy Caney Jul 16 '22 at 19:56
  • This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/32222941) – Rabinzel Jul 17 '22 at 05:19
  • I have addressed this concern in a new answer https://stackoverflow.com/a/74683624/711085 . No reply is necessary, but I thought I'd comment here in case it's helpful in the future. Cheers. – ninjagecko Dec 05 '22 at 05:16