I'm trying to extract the degree (Mild
/Moderate
/Severe
) of an specific type heart dysfunction (diastolic dysfunction) from a huge number of echo reports.
Here is the link to the sample excel file with 2 of those echo reports.
The lines are usually expressed like this: "Mild LV diastolic dysfunction" or "Mild diastolic dysfunction". Here, "Mild" is what I want to extract.
I wrote the following pattern:
pattern <- regex("(\\b\\w+\\b)(?= (lv )?(d(i|y)astolic|distolic) d(y|i)sfunction)",
ignore_case = FALSE)
Now, let's look at the results (remember I want the "Mild" part not the "LV" part):
str_view_all(df$echo, pattern)
As you can see in strings like "Mild diastolic dysfunction" the pattern correctly selects "Mild", but when it comes to "Mild LV diastolic dysfunction" pattern selects "LV" even though I have brought the lv
inside a positive lookahead (?= ( lv)?)
construct.
Anyone knows what am I doing wrong?