0

I'm trying to matching everything in comma delimited string that is not a Change Notice (CN) id, which is an alpha-numeric id starting with "CN". The string is a list of items separated by a comma, where each item entry shows the item id followed by a "~-~" and some jargon.



He's an example string:

CN98765432~-~ECN for A01234 Rev A,CR00098765~-~ECR for A12345 SOME PART NAME,CN12345678~-~ECN for A12345 Rev A

In this string, I want to match everything except "CN98765432" (which appears at the beginning) and "CN12345678" (which appears at the end after the last comma)


I've tried using .*(?=CN\d), which I assumed would grab everything that ends before a "CN", but that incorrectly matched

CN98765432~-~ECN for A01234 Rev A,CR00098765~-~ECR for A12345 SOME PART NAME,

, which includes the initial CN.

I also tried .*((?=CN\d)|$), but that matched the entire string.



I've looked at similar problems but I was not able to adapt the answers into something suitable for my issue.

How to match "anything up until this sequence of characters" in a regular expression?

Regex everything but

How do I match everything except for the CN IDs?



I'm using regex ex inside a java based software, so I believe this is a JavaScript flavored regex.

Some One
  • 3
  • 2
  • if it's comma separated, maybe it's easier to just split it first? – Nyerguds Apr 27 '18 at 21:35
  • I can split it and then filter out what I don't want, but its less efficient than deleting regex matches so I'd like to see if using a regex is possible. – Some One Apr 27 '18 at 23:09

2 Answers2

1

For you example string you could try it like this to select all except "CN98765432" and "CN12345678" and as you state in your comment on karakfa's answer:

"Ideally I would want CN98765432,CN12345678 to be all that's left"

,?(?!CN\d+)\b[\w~ -]+

That would match

  • ,? Match optional comma
  • (?! Negative lookahead that asserts what is on the right side is not
    • CN\d+ Match CN followed by one or more digits
  • ) Close negative lookahead
  • \d Word boundary
  • [\w~ -]+ Character class repeated one or more times with the characters you allow to match
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Thank you! Can you explain why `.*(?=CN\d)` didn't work? – Some One Apr 28 '18 at 22:34
  • @SomeOne [`.*(?=CN\d)`](https://regex101.com/r/Wgdl8R/1) does not select all that is not a Change Notice (CN) because `.*` selects any character zero or more times ([greedy](https://www.regular-expressions.info/repeat.html)) until you encounter `CN` followed by a digit so it would match all until it encounters the last `CN\d`. – The fourth bird Apr 29 '18 at 08:05
0

it might be easier to just delete the matched pattern

$ sed -E 's/CN[0-9]+//g' file

~-~ECN for A01234 Rev A,CR00098765~-~ECR for A12345 SOME PART NAME,~-~ECN for A12345 Rev A

if you want to capture the patterns

$ grep -oP 'CN[0-9]+' file | paste -sd,

CN98765432,CN12345678
karakfa
  • 66,216
  • 7
  • 41
  • 56
  • I'm trying to keep the CN's by deleting everything else. The software I'm using cannot "grab" matches, it can only delete them. Ideally I would want `CN98765432,CN12345678` to be all that's left. – Some One Apr 27 '18 at 21:18