Find matches ending with a letter that is not a starting letter of the next match

Question

Intro

I have a string containing diagnosis codes (ICD-10), not separated by any character. I would like to extract all valid diagnosis codes. Valid diagnosis codes are of the form

[Letter][between 2 and 4 numbers][optional letter that is not the next match starting letter]

The regex for this pattern is (I believe)

\w\d{2,4}\w?

Example

Here is an example

mystring='F328AG560F33'

In this example there are three codes:

'F328A' 'G560' 'F33'

I would like to extract these codes with a function like str_extract_all in R (preferably but not exclusively)

My solution so far

So far, I managed to come up with an expression like:

str_extract_all(mystring,pattern='\\w\\d{2,4}\\w?(?!(\\w\\d{2,4}\\w?))')

However when applied to the example above it returns

"F328"  "G560F"

Basically it misses the letter A in the first code, and misses altogether the last code "F33" by mistakenly assigning F to the preceding code.

Question

What am I doing wrong? I only want to extract values that end with a letter that is not the start of the next match, and if it is, the match should not include the letter.

Application

This question is of great relevance for example when mining patient Electronic Health Records that have not been validated.

Please update your answer and use the built in formatting tools to format your code examples. — Soviut, Oct 30 '17 at 16:14
Can you explain why you've written your regex the way you do? — Soviut, Oct 30 '17 at 16:22
As I wrote in the question, I wrote the regex that way because I thought I needed to "tell the regex engine to match the longest possible pattern that is not followed by a valid pattern". As both Xophmeister and Wiktor pointed out, this was not the right approach. — Gino_JrDataScientist, Oct 31 '17 at 10:11
Ah, yes! I had tried earlier but couldn't bc of low rep. You are a keen observer :) If not too much bother would you mind explaining one thing: in your suggested regex you wrap the optional letter in a non-capturing group. So the optional letter should really not be captured. And still it does! What gives? — Gino_JrDataScientist, Nov 02 '17 at 09:04
@Gino_JrDataScientist A non-capturing group only prevents from creating a memory buffer for the part of the match captured with the group pattern(s). However, these patterns are still *consuming*, i.e. the chars they match are added to the match/capture. In the pattern below, there is an outer capturing group inside the lookahead. That means that `(a(?:bc)?)` will still capture `abc` if there is `abc` in the input string, but there will be no second capture group for `bc` in the match data object. — Wiktor Stribiżew, Nov 03 '17 at 00:29
I see. Basically I didn't know the difference between capturing and consuming. Guess I have to finally start learning the basics :) Thank you for the clear and pedagogic explanation! — Gino_JrDataScientist, Nov 03 '17 at 12:43

score 2 · Answer 1 · answered Oct 30 '17 at 16:43

2

You have a letter, two-to-four numbers then an optional letter. That optional letter, if it's there, will only ever be followed by another letter; or, put another way, never followed by a number. You can write a negative lookahead to capture this:

\w\d{2,4}(?:\w(?!\d))?

This at least works with PCRE. I don't know about how R will handle it.

answered Oct 30 '17 at 16:43

Xophmeister

8,884
4
44
87

For some reason this regex returns also codes starting with a number.. For example with '328AG560F33' it returns '328A' 'G560' 'F33' I know I did not specify that there can be invalid patterns.. So this answer is valid for the question I posed. However, Wiktor's regex seem to handle well also broken patterns like this one. Thank you for your very good answer in any case! :) – Gino_JrDataScientist Oct 31 '17 at 09:56
@Gino_JrDataScientist No, do not use `[A-z]`. [It matches more than letters.](https://stackoverflow.com/questions/29771901/why-is-this-regex-allowing-a-caret/29771926#29771926) – Wiktor Stribiżew Oct 31 '17 at 12:17
**EDIT** it's my fault: in my question I said that the regex for a valid pattern is with \w whereas it should have been [A-z] – Gino_JrDataScientist Oct 31 '17 at 12:27
Thanks @WiktorStribiżew!! I really did not know about it – Gino_JrDataScientist Oct 31 '17 at 12:28

Wiktor Stribiżew · Accepted Answer · 2017-10-31T12:36:03.733

1

Your matches are overlapping. In this case, you might use str_match_all that allows easy access to capturing groups and use a pattern with a positive lookahead containing a capturing group inside:

(?i)(?=([A-Z]\d{2,4}(?:[A-Z](?!\d{2,4}))?))

See the regex demo

Details

(?= - a positive lookahead start (it will be run at every location before each char and at the end of the string
( - Group 1 start
- [A-Z] - a letter (if you use a case insensitive modifier (?i), it will be case insensitive)
- \d{2,4} - 2 to 4 digit
- (?: - an optional non-capturing group start:
  - [A-Z] - a letter
  - (?!\d{2,4}) - not followed with 2 to 4 digits
- )? - the optional non-capturing group end
) - Group 1 end
) - Lookahead end.

R demo:

> library(stringr)
> res <- str_match_all("F328AG560F33", "(?i)(?=([A-Z]\\d{2,4}(?:[A-Z](?!\\d{2,4}))?))")
> res[[1]][,2]
[1] "F328A" "G560"  "F33"

edited Oct 31 '17 at 12:36

answered Oct 30 '17 at 16:36

Wiktor Stribiżew

607,720
39
448
563

Thank you very much! This regex works for the task that I described. I chose this answer because it seems also resilient to ill-specified diagnosis codes and "noise" like punctuation. For example with the string 'F33%10CB1203SA12 F2¤1' it will return "F33" "B1203S" "A12" which is precisely what I want. – Gino_JrDataScientist Oct 31 '17 at 10:06
I just don't quite understand why it works. I know you broke your regex down, and thanks for that! I guess I just need to understand this lookahead business better. Thanks again! – Gino_JrDataScientist Oct 31 '17 at 10:15
this answer is actually not completely right: in the example I gave the expected result was "F328A" "G560" "F33" while your demo gives the following "F328A" "G560F" "F33" The problem is that the second code has an extra F in the end.. Any idea how to fix this? – Gino_JrDataScientist Oct 31 '17 at 12:08
@Gino_JrDataScientist Where is the `F` gone then? It is optional in the pattern and it is there inside the string. – Wiktor Stribiżew Oct 31 '17 at 12:10
@Gino_JrDataScientist Ok, I will fix it in a minute or two if it is fixable. – Wiktor Stribiżew Oct 31 '17 at 12:11
yes the F is in the pattern but it belongs to the following code because "F33" is a valid code. If the string had been "F328AG560F3" then the expected result should be "F328A" "G560F" because F3 is not a valid code – Gino_JrDataScientist Oct 31 '17 at 12:19
I see, please next time explain the match boundaries more precisely, it makes a huge difference for regexps since you need to also account for *context*, not just the type of chars that may or may not appear. – Wiktor Stribiżew Oct 31 '17 at 12:21
Will do. Sorry I am very new to this.. How would you have explained the match boundaries, may I ask? – Gino_JrDataScientist Oct 31 '17 at 12:29
@Gino_JrDataScientist, OK, found, I think it is [`(?=([A-Z]\\d{2,4}(?:[A-Z](?!\\d{2,4}))?))`](https://regex101.com/r/cId2gV/1) – Wiktor Stribiżew Oct 31 '17 at 12:30
@Gino_JrDataScientist: You spoke about longest/shortest matches, but that is irrelevant. What you need to specify is that you only want to extract values that must end with a letter that is not the start of the next match. – Wiktor Stribiżew Oct 31 '17 at 12:31
@Gino_JrDataScientist See my final update. It should work well for all inputs now. – Wiktor Stribiżew Oct 31 '17 at 12:36

Find matches ending with a letter that is not a starting letter of the next match

2 Answers2