2

I'm trying to extract the degree (Mild/Moderate/Severe) of an specific type heart dysfunction (diastolic dysfunction) from a huge number of echo reports.

Here is the link to the sample excel file with 2 of those echo reports.

The lines are usually expressed like this: "Mild LV diastolic dysfunction" or "Mild diastolic dysfunction". Here, "Mild" is what I want to extract.

I wrote the following pattern:

pattern <- regex("(\\b\\w+\\b)(?= (lv )?(d(i|y)astolic|distolic) d(y|i)sfunction)",
                               ignore_case = FALSE)

Now, let's look at the results (remember I want the "Mild" part not the "LV" part):

str_view_all(df$echo, pattern)

As you can see in strings like "Mild diastolic dysfunction" the pattern correctly selects "Mild", but when it comes to "Mild LV diastolic dysfunction" pattern selects "LV" even though I have brought the lv inside a positive lookahead (?= ( lv)?) construct.

Anyone knows what am I doing wrong?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563

2 Answers2

2

The problem is that \w+ matches any one or more word chars, and the lookahead does not consume the chars it matches (the regex index remains where it was).

So, the LV gets matched with \w+ as there is diastolic dysfunction right after it, and ( lv)? is an optional group (there may be no space+lv right before diastolic dysfunction) for the \w+ to match).

If you do not want to match LV, add a negative lookahead to restrict what \w+ matches:

\b(?!lv\b)\w+\b(?=(?:\s+lv)?\s+d(?:[iy]a|i)stolic d[yi]sfunction)

See the regex demo

Also, note that [iy] is a better way to write (i|y).

In R, you may define it as

pattern <- regex(
    "\\b(?!lv\\b)\\w+\\b(?=(?:\\s+lv)?\\s+d(?:[iy]a|i)stolic\\s+d[yi]sfunction)",
    ignore_case = FALSE
)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks again Wiktor for the quick answers, but I'm a bit confused on why in `(?= (lv )?(d[iy]astolic|distolic) d[yi]sfunction)` the lookahead hides the `(d[iy]astolic|distolic) d[yi]sfunction)` part from the str_view() results but not the `( lv)?` part. both of them come after the positive lookahead. – Kasra Mehdizadeh Sep 03 '21 at 09:41
  • 1
    @KasraMehdizadeh `( lv)?` is an *optional* group. So, the `\w+` also matches before both `lv distolic dysfunction` and `distolic dysfunction`. – Wiktor Stribiżew Sep 03 '21 at 09:45
  • Thanks again Wiktor, for being out there looking after rookies like me. – Kasra Mehdizadeh Sep 03 '21 at 09:49
  • sorry again, but with the pattern you wrote, str_view() still selects the "LV" parts – Kasra Mehdizadeh Sep 03 '21 at 10:17
  • 1
    @KasraMehdizadeh The regex does not match them. – Wiktor Stribiżew Sep 03 '21 at 10:19
  • [but it still does](https://drive.google.com/file/d/1Vc4VozUC5KOtDPtVniQ8P8BYCH_vrNlu/view?usp=sharing) – Kasra Mehdizadeh Sep 03 '21 at 10:27
  • 1
    @KasraMehdizadeh No, [it does **NOT**](https://imgur.com/a/fTxpozk) – Wiktor Stribiżew Sep 03 '21 at 10:28
  • I forgot to set the ignore_case = TRUE. this will fix the problem in str_view() **BUT** the problem persists when I want to use tidyr::extract() function. `extract(df, echo, c("diastolic dysfunction"), pattern)` – Kasra Mehdizadeh Sep 03 '21 at 12:41
  • 1
    @KasraMehdizadeh `extract(df, echo, c("diastolic dysfunction"), "(?i)\\b(?!lv\\b)(\\w+)\\b(?=(?:\\s+lv)?\\s+d(?:[iy]a|i)stolic\\s+d[yi]sfunction)")` works for me. – Wiktor Stribiżew Sep 03 '21 at 12:45
  • 1
    I now I understand what is wrong, apparently `regex()` arguments (such as `ignore_case = TRUE`) only work with functions from `stringr` package and not with the `tidyr::extract()` function. so instead of the `regex("pattern", ignore_case = TRUE)` I have to use `"(?i)pattern"` just as you did. thanks again. that's all. – Kasra Mehdizadeh Sep 03 '21 at 13:07
1

Using \w+ can also match LV and the lv part is optional.

Instead of a lookahead, you can also use a capture group.

\b(?!lv)(\w+)\b (?:lv )?(?:d[iy]astolic|distolic) d[iy]sfunction

regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70