2

I need to come up with a regular expression with flavor PCRE. It must be a regular expression <

I want to grab all lines of text that end in a newline character up until I encounter <zz> where zz is a digit enclosed in '<' and '>'.

e.g.

111a z
222 aset
333 //+
12 <zz> 11
abc
def

It would need to capture "111a z", "222 aset", "333 //+" in this case [and nothing else]. Right now I have ^(?!.*<zz>)[^\n]+(?=\n) but it's pretty far off from what it needs to be.

For clarification purposes, the regex I was using shows <zz>, but definitely looking for a digit enclosed in angle brackets.

Would really appreciate some help.

Edit

This is /really/ difficult for me, because at least one of the answers looks like it does the job. I'll try to mark one... Thank you, everyone.

aaaa
  • 246
  • 1
  • 9
  • 1
    You say 'where zz is a digit enclosed in [angle brackets]', but your example has literal `` in there - so which is it? Also, is what you're really saying you want every complete line before the first line that contains `` or `<\d+>` (whichever is the case)? – Grismar Jun 05 '20 at 04:01
  • You basically need the first match from this: `"^((?!<\\d+>).)*$"` with `PCRE_DOTALL` and `PCRE_MULTILINE`. But it's been answered before, I'll vote to close with link. – Grismar Jun 05 '20 at 04:08
  • Does this answer your question? [Regular expression to match a line that doesn't contain a word](https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word) – Grismar Jun 05 '20 at 04:08
  • Which is it - digit enclosed in angle brackets. And yes. – aaaa Jun 05 '20 at 04:08
  • Haven't spent enough time to distill information from the topics you linked, but at a high level it's probably similar. – aaaa Jun 05 '20 at 16:31

3 Answers3

2

You could repeat matching all lines including a Unicode newline sequence while the <\d+> pattern does not occur in the line.

\A(?:(?!.*<\d+>).*\R)+

Explanation

  • \A Start of string
  • (?: Non capture group
    • (?!.*<\d+>) Negative lookahead, assert that the pattern <\d+> does not occur
    • .*\R Match any char except a newline followed by matching a Unicode newline sequence
  • )+ Close the non capturing group, and repeat it 1+ times to match at least a single line

Regex demo


If the <\d+> has to be present, you could assert that with a positive lookahead at the end

\A(?:(?!.*<\d+>).*\R)+(?=.*<\d+>)
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • This is actually more in line with what I described than one of the other answers here. Sadly, I didn't do an adequate job describing what I needed. Was an oversight on my part. Oddly, an answer that failed to address one of the requirements in full works for my use case better. – aaaa Jun 05 '20 at 14:40
  • Charlie Armstrong's answer actually worked for my use case better, but only because I failed to describe the problem adequately. Your answer along with one other were probably better. I don't really have enough knowledge about which is better in terms of performance, so I just went ahead and marked one that seemed to match. – aaaa Jun 05 '20 at 14:53
  • @aaaa Sure, can you add the example data that you have to a https://regex101.com/ link, save and share the link here in the comments and specify what should and should not be matched. – The fourth bird Jun 08 '20 at 18:48
  • https://regex101.com/r/2hO0Tn/1 - So, this is almost what I want. I just don't want it to fail on the first instance of seeing a digit enclosed in angle brackets. Would you perhaps know how to tweak this one? It's tantalizingly close. – aaaa Jun 08 '20 at 18:57
  • @aaaa So the logic right now of the pattern is: match the whole line followed by a newline if what is on the right contains `<\d+>`. And that `what is on the right` part can be on any line as `[\s\S]` matches any char including a newline. It does not match the last line, as that line actually contains `<00>` What should the expected match be? – The fourth bird Jun 08 '20 at 19:08
  • In the very first line where it says, `000 <00date=0000-00-00,time=00:00:00,devname="",devid="",logid="",type="",subtype="",level="",vd="",eventtime=,srcip=000.00`, if that were to say, " `000 <00>date=...` I would want that line matched since it's only the first instance of it. Where Cary Swoveland says, "I have assumed that the text may have more than one line that contains one or digits bracketed in '<' and '>', and that those lines are not themselves to be matched." is correct. Those lines are not to be matched except for the first instance of it. – aaaa Jun 08 '20 at 19:15
  • @aaaa If this is just about the first line, you could optionally match it https://regex101.com/r/Hxg3oN/1 – The fourth bird Jun 08 '20 at 19:33
  • I'll provide another example: https://regex101.com/r/4DJToA/2 -- In this case, `zzadsgas` is not correct. It should have stopped matching things after `<1>`, but it should have grabbed the first line as if it were "`000 <0date=0000-00-00,time=00:00:00,devname="",devid="",logid="",type="",subtype="",level="",vd="",eventtime=,srcip=000.00.00.000,srcp`". – aaaa Jun 08 '20 at 19:35
  • With the current logic, is will match `zzadsgas` Why should it not match it? That is exactly what the positive lookahead is doing, matching the whole line as what is somewhere on the right is `<\d+>` Should `<\d+>` occur only once and then stop the match? – The fourth bird Jun 08 '20 at 19:41
  • At a high level, I'm trying to simply get lines from the first "event" only. An event starts with a digit enclosed in angle brackets and ends with a newline. The problem is that some events have newline characters in the middle of them. I'm trying to come up with a regular expression to tackle this problem. https://regex101.com/r/21ldPF/1 - This shows why it's failing. It grabbed the second event. – aaaa Jun 08 '20 at 19:50
  • To get the first one only, you could omit the global flag and grab all the following lines that do not contain the digit pattern.`^.*<\d+>.*(?:\r?\n(?!.*<\d+>).*)*` https://regex101.com/r/ACH0q4/1 – The fourth bird Jun 08 '20 at 20:15
  • I should explain. I'm working with https://www.elastic.co/guide/en/logstash/current/plugins-codecs-multiline.html. I sadly cannot just disable the global flag that I'm aware of. My use case probably demands that. – aaaa Jun 08 '20 at 20:27
  • Is there any way that you know of that I could return the first match without omitting that flag or perhaps something analagous? I thought maybe `\1` might work, but cannot figure out how to integrate it. – aaaa Jun 09 '20 at 16:01
  • I have knowledge of logstash unfortunately. You could give this a try `\A.*<\d+>.*(?:\r?\n(?!.*<\d+>).*)*` I read the readme here, and is it not supposed to match multiple values? https://github.com/logstash-plugins/logstash-codec-multiline/blob/master/docs/index.asciidoc – The fourth bird Jun 09 '20 at 16:16
  • I wasn't really expecting you to [or even distill information from it for that matter], I just was placing the link there to explain the higher level problem that I'm trying to solve. I think it is supposed to match multiple. I think that it's global by default with the plugin I'm using. Though, there may be a way around that. I'm just trying everything I can to solve this problem. I will attempt the expression you sent in a second. – aaaa Jun 09 '20 at 19:11
  • That may have actually solved the problem. I can't even begin to thank you enough if that was it. I need to do more rigorous testing, but could you explain the small change that you made? – aaaa Jun 09 '20 at 19:16
  • I used `\A` instead of `^` See this page for more info about the anchors https://www.rexegg.com/regex-anchors.html#A – The fourth bird Jun 09 '20 at 19:40
  • Thank you, will check it out. – aaaa Jun 09 '20 at 20:11
  • I just want to say thank you for everything you've helped with. I haven't found any problems with what you provided and have learned a lot from this whole process. – aaaa Jun 09 '20 at 23:12
  • Could we go to chat briefly? I sadly have one more restriction that I was unaware of at the time. I've tried fixing it myself, but still struggling. It turns out the "events" are separated by escaped newline characters. I can share what I've come up with so far if it would help. @The fourth bird – aaaa Jun 11 '20 at 19:12
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/215772/discussion-between-aaaa-and-the-fourth-bird). – aaaa Jun 11 '20 at 19:16
0

I'm not sure why you're using a negative lookahead, but I think you want a positive lookahead. This lets you only match the line if you see the <zz> in a lookahead. I would solve the problem using something like this:

^.*(?=.*(?:\n.*)*<\d+>)\n
  • ^ Anchors match to beginning of line (like yours)
  • .* Matches all the characters it can. In this case it matches the whole line because it has to satisfy the \n at the end.
  • (?=...) Performs a positive lookahead (makes sure the string exists somewhere ahead)
  • .*(?:\n.*)* Allows any number of characters on any number of lines
  • <\d+> Only matches one or more digits enclosed in angle brackets
  • \n ensures that there is a newline at the end of the line.
Charlie Armstrong
  • 2,332
  • 3
  • 13
  • 25
  • Hey, Charlie Armstrong. Oddly your answer actually worked for my use case better. I think the answers from Cary Swoveland and The fourth bird were more in line with how I described the problem, but this was invaluable to me nonetheless. – aaaa Jun 05 '20 at 14:42
0

I have assumed that the text may have more than one line that contains one or digits bracketed in '<' and '>', and that those lines are not themselves to be matched.

You can use the following expression to match the lines of interest.

^(?!.*<\d+>).*\r?\n(?=[\s\S]*?<\d+>)

PCRE Demo

The regex engine performs the following operations.

^           match beginning of line
(?!         begin negative lookahead (prevent matching line with '<12>'
  .*        match 0+ characters other than newlines
   <\d+>    match '<', 1+ digits, '>'
)           end negative lookahead
.*          match 0+ characters other than newlines
\r?\n       match newline optionally preceded by '\r'
(?=         begin positive lookahead
  [\s\S]*?  match 0+ characters (incl. newlines), non-greedily
  <\d+>     match '<', 1+ digits, '>' 
)           end positive lookahead

'\r', a carriage return, will be present if the file was produced when using the Windows operating system.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • This is actually more in line with what I described than one of the other answers here. Sadly, I didn't do an adequate job describing what I needed. Was an oversight on my part. Oddly, an answer that failed to address one of the requirements in full works for my use case better. – aaaa Jun 05 '20 at 14:39
  • 1
    If your question was not clear you should edit to make it clear, even now that you’ve selected an answer. You owe that to current and future readers. One thing specifically is that you need to fix your example to address @Grismar’s comment. That leaps out for everybody who reads your question. – Cary Swoveland Jun 05 '20 at 16:04
  • The reason I didn't edit it was because you all did answer it as it was posed and correctly. I didn't accept Charlie Armstrong's answer as it didn't cover the case where it started with a digit enclosed in angle brackets. I will however address Grismar's comment; I agree with you, improves readability. – aaaa Jun 05 '20 at 16:24
  • Hey, Struggling a bit. regex101.com/r/2hO0Tn/1 - So, this is almost what I want. I just don't want it to fail on the first instance of seeing a digit enclosed in angle brackets. Would you perhaps know how to tweak this one? It's tantalizingly close – aaaa Jun 08 '20 at 19:00