2

What I want do:

Using Java, I want to match a RegEx pattern, unless the match is immediately followed by a "poison" suffix.

Exemples:

 "legitString" RETURNS "legitString"

 "legitString blabla" RETURNS "legitString"

 "legitString PoisonousSuffix" RETURNS "legitString"

 "legitStringPoisonousSuffix" RETURNS no match

My use case:

I need to parse as much references from a file as I can, following a particular pattern. But some lines of the file are truncated, and not always at the same length(!).

Luckily, when this happens, the line ends with ">>". I have to assume the reference is truncated and I have to discard it. So ">>$" would be the poisonous suffix in my case. On the other hand, if ">>" is in the middle of the text, I should safely extract the reference as I would normally do. (The reference ends with digits, but the number of digit can be different each time so I can't use that.)

So in my case:

"REF" RETURNS "REF"

"REF >>" RETURNS "REF"

"REF>>" RETURNS nothing

"REF>> bla " RETURNS "REF" // because in my case, the poison is only poisonous if in the end

I've seen: https://stackoverflow.com/tags/regex/info But I tried the syntax

myRegex(?!>>$)

and it looks wrong. It truncates the last legit digit of the reference when the line ends with ">>", which is the worst scenario: a corrupted reference going through.

I've seen: Regex for string not ending with given suffix but :

myRegex(?:(?!>>).).$

rejects legitimate references.

My exact regex (without poison) :

   \b(SWN-)?WZ-SB\d{2}(-\d{2}){2}-[A-Z]?\d* 

should return SWN-WZ-SB00-49-03-C11 for:

"SWN-WZ-SB00-49-03-C11>> bla"

"SWN-WZ-SB00-49-03-C11 >>  "

"SWN-WZ-SB00-49-03-C11 >>"

"SWN-WZ-SB00-49-03-C11 >> bla"

and nothing for:

"SWN-WZ-SB00-49-03-C11>>"

Bonus

Is there a way to generalize and have function taking regexPattern and poisonousSuffix and returning a safeRegexPattern?

Thanks

Akita
  • 287
  • 2
  • 8
  • The first one and the last one confuse me. Both legit strings end with a ">>". It's a little hard to see how the first is correct and the second is wrong. Is that possibly a mistake? Basically it seems like both end with the "poisonous suffix." – markspace Jul 05 '18 at 16:07
  • IF I understand correctly, I think I would just read each line and discard any that end with the poisonous suffix, ">>". Then parse as normal. This seems easier than trying to cram everything into one regex. Might be easier on your maintainers too trying to read an overly complicated regex. – markspace Jul 05 '18 at 16:12
  • 1
    @markspace In my use case, my poisonous suffix is not ">>", but ">> + EndOfLine", which I believe to be ">>$" in Regex. So the end of line is just relevant in my specific case. I will edit the question to make it clearer. – Akita Jul 05 '18 at 16:12
  • That's why I said "read each line." – markspace Jul 05 '18 at 16:13
  • @Akita Your requirements are out of sync with the examples. If your suffix is `>>$`, then `SWN-WZ-SB00-49-03-C11>> bla` should result in `SWN-WZ-SB00-49-03-C11>> bla` – Wiktor Stribiżew Jul 05 '18 at 16:27
  • @WiktorStribiżew No, because even if the poison is not active in this case, the rest of the regex should kick in and keep only the reference. – Akita Jul 05 '18 at 16:36
  • @markspace A dedicated " pattern of pattern" would be useful to create a function out of it. But If I don't manage to implement the poison idea or find a usefull library, I will do this of course. – Akita Jul 05 '18 at 16:42
  • @Akita after looking at your updated requirements I have updated my answer to account for your suffix. – emsimpson92 Jul 05 '18 at 16:50
  • Does the line continuation/truncate symbol only exist after a complete REF or can it appear practically everywhere? – wp78de Jul 05 '18 at 17:09
  • Depending on the file/string size, you could remove all >> and treat the string as a single line. – wp78de Jul 05 '18 at 17:20

1 Answers1

0

The proper way to do this is to use conditionals. Here is the pattern I used.

(?(?![\w-]+>>$)(?:([^\s>]*)(?:.*))|([^\w\W]))

I'll provide a breakdown for you:

(?...) is an if conditional

(?![\w-]+>>$) checks to see if the string contains the poisonous suffix

(([^ \n>]*)(?:.*)) captures everything up until you run into a space or >

| OR

([^\w\W]) captures nothing.

So the syntax for an if conditional is (?If(condition)then|else). What this pattern does is if the string does not contain the suffix, return the string up until the first space, but if it does, match nothing.

Demo

emsimpson92
  • 1,779
  • 1
  • 9
  • 24