2

I'm trying to capture all instances of text between < and > symbols, as well as the beginning word.

test <1> <example> <dynamic

I tried the following expression:

(\w+) (?:<(.*?)>)+

but it doesn't work. There can be 1 or more groups that I have to capture. What I expect this to capture is (groups): * test * 1 * example * dynamic

but all I get is: * test * 1

Can anyone help me figure out how to do this properly? Thanks a lot.

NS studios
  • 149
  • 14
  • You can't have a dynamic amount of groups in the pattern. – Wiktor Stribiżew May 14 '20 at 15:20
  • 1
    You could have 2 groups `(?:(\w+)|\G(?!^))\h+<([^<>]+)>` https://regex101.com/r/GTJLv9/1 – The fourth bird May 14 '20 at 15:21
  • @The fourth bird: Thanks so much. can you explain how this works? I've never seen \h or \G before, for example. – NS studios May 14 '20 at 15:31
  • Are you trying to parse HTML? If so, then that's a different problem entirely. – Andy Lester May 14 '20 at 15:45
  • Nope; just trying to parse commands for my program that are in the format name etc. – NS studios May 14 '20 at 16:39
  • Is there to be a match if the string were `"<1> "` or `"<1> test"`? If so, what would be the "first word"? An empty string, perhaps? Would it be sufficient to return an array of matches (as opposed to captures) of the first word (first element) and strings wrapped in `"<>"` (remaining elements)? – Cary Swoveland May 14 '20 at 16:42
  • No, the format must be name optional optional etc. Not sure I understand what you mean by arrays instead of captures. I'm using a PCRE library... – NS studios May 14 '20 at 17:00
  • Suppose the string were `"catdog"`. Then the regular expression `^\w+|(?<=<)\w+(?=>)` would match `"cat"`, `"polo"` and `"pony"`. [PCRE Demo](https://regex101.com/r/26TDme/1/). (There are no captures because the regex does not have a capture group.) With whatever programming language you are using you can easily return these matches in an array, `["cat", "polo", "pony"]`, the first element of which is always the first word in the string. I'm asking if that would meet your needs. P.S. Don't forget to include the intended receipt's user name in your comments, so SO will inform them. – Cary Swoveland May 14 '20 at 17:46
  • I didn't notice the requirement for spaces. Please disregard my comment above. – Cary Swoveland May 14 '20 at 18:39
  • @Cary Swoveland this works fine for a single line of commands, but not if I have a file with other stuff that I'm trying to extract those from, e.g., it captures just random words apart from the stuff in < >. I need capture to start with name – NS studios May 14 '20 at 20:28

1 Answers1

3

Using pcre you could have 2 groups, where the fist group will match test from the start of the string and the second group will contain the values between the brackets.

The \G anchor will match either at the start of the string, or asserts the position at the end of the previous match.

At the start of the string, you will match 1+ word characters.

(?:(\w+)|\G(?!^))\h+<([^<>]+)>

Regex demo

Explanation

  • (?: Non capture group
    • (\w+) Capture group 1, match 1+ word chars
    • | Or
    • \G(?!^) Assert position at the end of previous match, not at the start
  • ) Close group
  • \h+ Match 1+ horizontal whitespace chars
  • <([^<>]+)> Match < Capture in group 2 any char other than < or > and match >
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Thanks very much. One more thing: how would this work if I wanted to capture just stuff between < and > and not the first word? Also, is there any way to prevent blank groups from being captured for repetitions? the way it prints here is: test, new line, 1, new line, new line, example, new line, new line, dynamic. – NS studios May 14 '20 at 17:51
  • @CarySwoveland Should there be a match there? You could match 0+ horizontal whitespace chars `(?:(\w+)|\G(?!^))\h*<([^<>]+)>` https://regex101.com/r/VZKFs9/1 – The fourth bird May 14 '20 at 17:53
  • 1
    @NSstudios If you want a single capturing group for the values between <> you could use `(?:\w+|\G(?!^))\h+<([^<>]+)>` https://regex101.com/r/BNqw4n/1 – The fourth bird May 14 '20 at 17:55
  • @NSstudios If you don't want to match newlines, you can exclude them from the character class `(?:(\w+)|\G(?!^))\h+<([^<>\r\n]+)>` https://regex101.com/r/eBwTcl/1 – The fourth bird May 14 '20 at 17:57
  • I see. Spaces are required. – Cary Swoveland May 14 '20 at 18:00
  • A possible hicup. In a comment on the question the OP says, "No, the format must be `name optional ...`". See my first example [here](https://regex101.com/r/GM0WYx/4/). – Cary Swoveland May 14 '20 at 18:38
  • @CarySwoveland Ah I see, then you can pin the word chars to the start of the string https://regex101.com/r/VlWPwp/1 `(?:^(\w+)|\G(?!^))\h+<([^<>\r\n]+)>` – The fourth bird May 14 '20 at 18:47
  • ...but what about "pony", "mouse" and "rabbit" in the second example? They don't know why they've been left out. – Cary Swoveland May 14 '20 at 19:03
  • They should ask the caret...:-) – The fourth bird May 14 '20 at 19:06
  • @The fourth bird totally forgot I could use non-capturing group to make it capture only parameters, and \r\n to stop newlines, thanks. That still doesn't stop an empty group from being created, though, but it may just be me. Is there maybe a way all of the parameters could be under a single match? Right now I get separate matches for 2nd and above parameters. [ { "g": [ "test", "1" ], "m": "test <1>" }, { "g": [ "", "example" ], "m": " " }, { "g": [ "", "dynamic" ], "m": " " } ] – NS studios May 14 '20 at 21:09
  • If you don't want to match an empty group `(?:(\w+)|\G(?!^))\h+<(\h*[^\s<>][^<>\r\n]*)>` https://regex101.com/r/dQKTso/1 Or all the parameters in a single group https://regex101.com/r/G3yk2g/1 – The fourth bird May 14 '20 at 21:17
  • @The fourth bird Thanks, but the first example still does empty groups, and I would prefer ifthe the second one did a single match, not a single group (if possible). Here's an example of the ideal result: [ { "groups": [ "test", "1", "example", "dynamic" ], "match": "test <1> " }, ] – NS studios May 15 '20 at 00:11
  • 1
    @NSstudios I don't think you can do that. You could use this pattern to match word character between `<>` and not match empty values in group 2 `(?:^(\w+)|\G(?!^))\h+<(\w+)>` https://regex101.com/r/zi3cb3/1 The other option is to get a single match and split group 2 on a space https://regex101.com/r/7FYCTo/1 – The fourth bird May 15 '20 at 11:50