0

This is a following up question regarding Lazy (ungreedy) matching multiple groups using regex. I try to use the method but not very successful.

I grab a string from gitlab API and try to extract all the repos. The name of repo follows the format of "https://gitlab.example.com/foo/xxx.git".

So far, if I try this, it works OK.

gitlab_str.scan(/\"https\:\/\/gitlab\.example\.com\/foo\//)

But to add name wildcard is tricky, I use the method from the previous question:

gitlab_str.scan(/\"https\:\/\/gitlab\.example\.com\/foo\/(.*?)\.git\"/)

It says to use (.*?) for lazy matching, but it doesn't seem to work.

Thanks a lot for the help.

user180574
  • 5,681
  • 13
  • 53
  • 94

1 Answers1

1

If we have the following string:

gitlab_str = "\"https://gitlab.example.com/foo/xxx.git\""

The following RegEx will return [["xxx"]], which is expected:

gitlab_str.scan(/\"https\:\/\/gitlab\.example\.com\/foo\/(.*?)\.git\"/)

Because you had the (.*?). Note the parenthesis, so only what's inside the parenthesis will be returned. If you want to return the whole string matched, you can just remove the parenthesis:

gitlab_str.scan(/\"https\:\/\/gitlab\.example\.com\/foo\/.*?\.git\"/)

This will return:

["\"https://gitlab.example.com/foo/xxx.git\""]

It also works for multiple occurrences:

> gitlab_str = "\"https://gitlab.example.com/foo/xxx.git\" and \"https://gitlab.example.com/foo/yyy.git\""
> gitlab_str.scan(/\"https\:\/\/gitlab\.example\.com\/foo\/.*?\.git\"/)

=> ["\"https://gitlab.example.com/foo/xxx.git\"", "\"https://gitlab.example.com/foo/yyy.git\""]

Finally, if you want to remove the https:// part from the resulting matches, then just wrap everything but that part with () in the RegEx:

gitlab_str.scan(/\"https\:\/\/(gitlab\.example\.com\/foo\/.*?\.git)\"/)
Toribio
  • 3,963
  • 3
  • 34
  • 48
  • I see, previously I thought it doesn't work because it would also match things like "https://gitlab.example.com/foo/xxx, name:"...", path: "...".........git". In other words, there is garbage between "xxx" and ".git". So instead of allowing (.*), I should restrict the alphabet set. Thanks! – user180574 Jun 02 '17 at 00:14
  • If the URL has cases where it doesn't end with `.git"` then this RegEX will be a problem, so you'd need a more sophisticated matching, like limiting the alphabet instead of using a wildcard, or expecting a `"` before a `.git`, etc... – Toribio Jun 02 '17 at 00:19
  • Thanks, for this case, I replace .*? with [^,]+ because comma should not appear in repo name, which works pretty good. – user180574 Jun 02 '17 at 18:44