1

Can anyone help me with a problem I am encountering? Pretty familiar with Google Spreadsheet but not regex.

I have a function

=REGEXEXTRACT(A1, ".*\/(\w+)\/(\w+)\/")

so for arguments sake, A1 =

google.com/structure1/structure2/structure3/page.php

The it spits out into B1 'Structure2' and C1 'Structure3'.

Now ideally what I want is 'Structure3' in B1 and 'Structure2' in C1.

I know I can just make D1 = B1 but I want to figure it out.

The regex was generated from google and SO searches which I do not fully understand.

What I am trying to achieve is inputting a series or URLs in Column A and then getting it hierarchy basically.

The above regex does not work if I have

google.com/page.php

Any help appreciated.

Edit: I'm guessing a script would have to be make for the last problem but am struggling figuring out how to tackle it?

neminem
  • 2,658
  • 5
  • 27
  • 36
  • Your regex won't work on `google.com/page.php` because as you've written it, it's looking for three forward slashes, with captures on all the components between the first and last. You'll either need to make a less explicit regex for your cell contents, or construct one that can operate on variable groups. This might help: http://stackoverflow.com/questions/5018487/regular-expression-with-variable-number-of-groups – Ryan J Feb 15 '14 at 00:35
  • Ok so I read up on regex and think I have a better understanding now thanks – user-userID-unknown Feb 17 '14 at 04:09

2 Answers2

0

REGEXEXTRACT is a built-in function which takes two arguments, the input value location and the regex. It then places the captured groups (portions within parentheses) in the next cells, in order. You can't change that behavior because you can't edit the code of the built-in function. D1 = B1 is your best bet if you want structure3 before structure2.

That regex does not match google.com/page.php. What is your desired behavior? If you want it to always match, try making the extra structures in the URL optional, like so:

".*(?:\/(\w+))?(?:\/(\w+)\/)?"

(?:foo) is a non-capturing group (i.e. it won't be put in B1). (?:bar)? makes the non-capturing group optional.

Patrick Fisher
  • 7,926
  • 5
  • 35
  • 28
  • Ideally I would like to write a script which would make use of Split and then regex to capture the last 2 groups in a URL. From there then use an organizational chart to view a hierarchy. I'm suprised this has not already been done (that I know of). Will try a few different things perhaps using string.split() and then update the thread – user-userID-unknown Feb 17 '14 at 04:11
0

I came up with an alternative solution usng custom functions.

It is is no way elegant or optimised but does the job for now

function parent(w) {
var string = w.split("/");
var components = w.split("/").length;
var parent = components-2;
var child = string[parent];
return (child);
}

function child(w) {
var string = w.split("/");
var components = w.split("/").length;
var parent = components-1;
var child = string[parent];
return (child);
}

Wrap the cell in array, count the array, then pull the last component and second last part of the string.