1

Question: looking for Google Sheets regex that captures all instances of a string between [t- ] and outputs to the neighboring column cell as an array, or some other delimiter between matches.

For the following string, I'm trying to extract all instances of text between [t- ].

A1:

Lorem Ipsum [t- is simply] dummy text of the [t- printing] and typesetting [c- industry], so [d- it would make] sense that

Expected Output is an array of all occurrences in a single column:

B1:

[is simply, printing]

Or output could be any delimiter of match occurrences

is simply | printing

Trying the following with a single text within [t- ] works fine, but for multiple instances it extracts everything between open [t- of first occurrence and ] of last occurrence:

=REGEXEXTRACT(A1,"\[t- (.*)\]")

Leading to:

is simply]! dummy text of the [t- printing

I've also tried multiple capture groups, but this only works if I'm sure there's only two instances of text between [t- ]— there could be n instances per row. Also it doesn't output results to an arrary in one column, but spreads across multiple columns:

=regexextract(A1, "(\[t- (.*)\]).*(\[t- (.*)\])" )


EDIT: I've received a couple answers with Regex that works for other tools/languages (e.g., PHP or Javascript) but not Google Sheets. Here's the Google Sheets Regex Syntax.

EDIT 2: The above sample string has other text inside brackets marked with other letters, e.g., [c- industry] and [d- it would make]. These should not be included. Only texted in [t- ] (with a "t-") should be returned.

TheMaster
  • 45,448
  • 6
  • 62
  • 85
user3871
  • 12,432
  • 33
  • 128
  • 268
  • Try `\[t-([^]]*)\]`. You can check and get an explanation here https://regex101.com/r/iemQDb/1. – Andrei Odegov Feb 03 '19 at 19:12
  • @AndreiOdegov this only gets the first occurrence – user3871 Feb 03 '19 at 19:13
  • 1
    what about REGEXREPLACE and the `(?:^|\])[^\[]*(?:\[t-|$)` regex? check here https://regex101.com/r/WwFjcy/1. – Andrei Odegov Feb 03 '19 at 19:37
  • @AndreiOdegov this works better, but this also gets text inside brackets with other characters, e.g., `[c- ]`, or `[d- ]`. It needs to only get texts within `[t- ]` and ignore replacing text when there's other brackets (like `[c- ]` or `[d- ]`) – user3871 Feb 03 '19 at 19:52
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/187816/discussion-between-growler-and-andrei-odegov). – user3871 Feb 03 '19 at 20:41
  • Who is the author of the "Google Sheets Regex Syntax"? At the bottom of it there is link to a blank page. – Rubén Feb 03 '19 at 20:52
  • `@Growler`, all the same REGEXREPLACE, but with new regex `^.*?\[t-|(?<=\[t-).*?\K\].*?(?:\[t-|$)`. Check it here https://regex101.com/r/WwFjcy/2. – Andrei Odegov Feb 03 '19 at 21:34
  • @AndreiOdegov thanks but are you not testing in google sheets? The syntax is invalid. – user3871 Feb 03 '19 at 21:38
  • @Onyambu what do you mean? That doesn't look like valid syntax for Google Sheets. How would it look in the entire formula? – user3871 Feb 03 '19 at 22:23

2 Answers2

7

Similar to the technique used here,

  • REGEXREPLACE all [t-.*] to (.*)
  • Provide the resulting expression from above as the regex to REGEXEXTRACT
  • \Q..\E is used to escape other characters

    =REGEXEXTRACT(A1, "\Q"&REGEXREPLACE(A1,"\[t-[^]]+\]","\\E(.*)\\Q")&"\E")
    

Alternatively, With just REGEXREPLACE to delimit,

=REGEXREPLACE(A1,"(^|\])(.*?)(\[t-|$)","$1|$3")
=SPLIT(REGEXREPLACE(A1,"(^|\])(.*?)(\[t-|$)","$1|$3"),"|")

Replace all characters .* that

  • Start with ] or start of string ^
  • End with [t- or end of string $

with ]|[t-

TheMaster
  • 45,448
  • 6
  • 62
  • 85
  • These are great - they work for all cases except when two `[t- ]` are next to each other with space and without space. E.g., when string is `[t- hello] [t- world]`. You end up with `[t- hello] [t-` in one column cell as match, and `world]` in the next. When string is `[t- hello][t- world]` (no space) you get `[t- hello][t- world]` as single match in cell – user3871 Feb 03 '19 at 22:46
  • @Growler The second regex works fine in both cases and the output is exactly the same as first with `split`. The first one didn't work because of the greedy `(.*)`. If we change it to `"\\E(\\Q$0\\E)\\Q"`, it works. – TheMaster Feb 04 '19 at 07:41
  • Thank you! This works great. It's really too bad `regexextract` doesn't have a greedy option to extract all instances – user3871 Feb 04 '19 at 15:25
  • 1
    @Growler, Compared to excel, having regex in spreadsheet itself is a boon. – TheMaster Feb 04 '19 at 16:46
3

bulletproof solution:

="["&JOIN(", ", ARRAYFORMULA(REGEXREPLACE(
 QUERY(SPLIT(TRANSPOSE(SPLIT(A1, "[")), "]"), 
 "select Col1 where Col1 contains 't- '", 0), 
 "t- ", "")))&"]"

__________________________________________________________

=JOIN(" | ", ARRAYFORMULA(REGEXREPLACE(
 QUERY(SPLIT(TRANSPOSE(SPLIT(A1, "[")), "]"), 
 "select Col1 where Col1 contains 't- '", 0), 
 "t- ", "")))

player0
  • 124,011
  • 12
  • 67
  • 124