Regex pattern to find matches in a data frame is only partly working in R

Question

I have a regex pattern of pattern = "\\d+(?:[.,]\\d+)*\\s*mu\\b|\\b(?:kinetin|zeatin)\\b"

I have a function to find matches in a data frame (this function was given to me so it might not make sense to my pattern)

find.all.matches <- function(search.col,pattern){
  captured <- str_match_all(search.col,pattern = pattern)
  t <- lapply(captured, str_trim)
  t2 <- lapply(t, function(x) gsub("[^a-z]","",x))
  t3 <- sapply(t2, unique)
  t4 <- lapply(t3, toString)
  found.col <- unlist(t4)
  return(found.col)
}

To find matches, I am using the chunk of code below and then I am adding the matches to a column in the data frame.

testing <- find.all.matches(search.col = all_data_1gen$abstract_l, 
                                    pat = pattern)
all_data_1gen$testing_matches <- testing

The pattern is pulling out basically only mu, kinetin and zeatin but in regex101: https://regex101.com/r/77oi0A/1 it shows my regex pattern pulling out amounts (which is what I need). Is the function the issue or the pattern the issue and how would I fix this?

Here's my string:

high-frequency somatic embryogenesis was achieved from an embryogenic cell suspension culture of acanthopanax koreanum nakai. stem segments were cultured on murashige and skoog (ms) medium containing auxins and cytokinins. opaque and friable embryogenic callus formed on ms medium with 4.5 mu m 2,4-dichlorophenoxyacetic acid (2,4-d) and 2.0 mu m kinetin or zeatin, but was highest on medium containing 4.5 mu m 2,4-d alone.

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Aug 14 '23 at 15:58

score 1 · Accepted Answer · answered Aug 14 '23 at 16:11

1

In your function, the second line (t2 <- lapply(t, function(x) gsub("[^a-z]","",x))) removes all characters outside of [a-z] range, so all numbers are ruled out. Commenting out that line in your function preserves the amounts.

string = "high-frequency somatic embryogenesis was achieved from an embryogenic cell suspension culture of acanthopanax koreanum nakai. stem segments were cultured on murashige and skoog (ms) medium containing auxins and cytokinins. opaque and friable embryogenic callus formed on ms medium with 4.5 mu m 2,4-dichlorophenoxyacetic acid (2,4-d) and 2.0 mu m kinetin or zeatin, but was highest on medium containing 4.5 mu m 2,4-d alone."

pattern = "\\d+(?:[.,]\\d+)*\\s*mu\\b|\\b(?:kinetin|zeatin)\\b"

find.all.matches <- function(search.col,pattern){
  captured <- stringr::str_match_all(search.col,pattern = pattern)
  t <- lapply(captured, stringr::str_trim)
  #t2 <- lapply(t, function(x) gsub("[^a-z]","",x))
  t3 <- sapply(t, unique)
  t4 <- lapply(t3, toString)
  found.col <- unlist(t4)
  return(found.col)
}

find.all.matches(string, pattern)
#> [1] "4.5 mu"  "2.0 mu"  "kinetin" "zeatin"

^{Created on 2023-08-14 with reprex v2.0.2}

answered Aug 14 '23 at 16:11

Alberson Miranda

1,248
7
25

Just curious... is there something that the custom function `find.all.matches()` is doing that isn't accomplished by `str_extract_all()`? I'm missing it if there is, and `stringr` is used in that custom function. I know function originated from the OP. – ScottyJ Aug 14 '23 at 16:38
I might be wrong, but I was under the impression that str_extract_all only worked on a vector. Would it work on a specific column (abstract_l) in a data frame (all_data_1gen)? Each cell in the column has a string. If so, how would I do that? – Melissa Duda Aug 14 '23 at 16:48
@wackojacko1997 no idea, I don't use tidyverse. With base, I'd probably do something like `regmatches(string, regexpr(pattern), string, perl = TRUE)` – Alberson Miranda Aug 14 '23 at 16:48
@AlbersonMiranda I had a feeling that this could be the issue because of the [^a-z] condition. I will try this! – Melissa Duda Aug 14 '23 at 16:50
1

@MelissaDuda if it's solved, please upvote and mark it as solved – Alberson Miranda Aug 15 '23 at 13:13

score 1 · Answer 2 · edited Aug 14 '23 at 17:09

1

Maybe not an answer, but directional help:

I don't think the problem is with the regex because I'm getting the same output you show in regex101:

library(stringr)

str <- "high-frequency somatic embryogenesis was achieved from an embryogenic cell suspension culture of acanthopanax koreanum nakai. stem segments were cultured on murashige and skoog (ms) medium containing auxins and cytokinins. opaque and friable embryogenic callus formed on ms medium with 4.5 mu m 2,4-dichlorophenoxyacetic acid (2,4-d) and 2.0 mu m kinetin or zeatin, but was highest on medium containing 4.5 mu m 2,4-d alone."

pat <- "\\d+(?:[.,]\\d+)*\\s*mu\\b|\\b(?:kinetin|zeatin)\\b"

str_view_all(str, pat)

> str_view_all(str, pat)  # May need to scroll right, but shows the matches it detects
[1] │ high-frequency somatic embryogenesis was achieved from an embryogenic cell suspension culture of acanthopanax koreanum nakai. stem segments were cultured on murashige and skoog (ms) medium containing auxins and cytokinins. opaque and friable embryogenic callus formed on ms medium with <4.5 mu> m 2,4-dichlorophenoxyacetic acid (2,4-d) and <2.0 mu> m <kinetin> or <zeatin>, but was highest on medium containing <4.5 mu> m 2,4-d alone.


str_extract_all(str, pat)

> str_extract_all(str, pat)
[[1]]
[1] "4.5 mu"  "2.0 mu"  "kinetin" "zeatin"  "4.5 mu"

edited Aug 14 '23 at 17:09

M--

25,431
8
61
93

answered Aug 14 '23 at 16:28

ScottyJ

945
11
16

Thank you so much for pointing me in the right direction. Does str_extract_ all only work on vector? I have a dataframe that I am trying to use this pattern on in a specific column. – Melissa Duda Aug 14 '23 at 16:46
I use the `stringr` package frequently, but not specifically `str_extract_all()`. If I clearly understood what your df looked like to start, and what the end result is supposed to look like, I think using that (along with some call to `unique()` would get you the same place, but I'm not sure. Until I saw @Albertson-Miranda's solution, I didn't realize that custom function was outputting almost exactly what `str_extract_all()` is doing. Seems like using an existing function would be the way to go. – ScottyJ Aug 14 '23 at 16:59
@MelissaDuda it works on a dataframe. ```df %>% rowwise() %>% mutate(list_of_amounts = str_extract_all(YourColumnName, YourPattern))``` – M-- Aug 14 '23 at 17:08
@M-- Oh! Thank you so much for the line of code. Does this produce a list of lists? – Melissa Duda Aug 14 '23 at 17:13
@MelissaDuda it should add a column (list column) to your dataframe, but without actually seeing your data, there's a chance that I'd miss something (especially while typing on my phone). – M-- Aug 14 '23 at 17:15

Regex pattern to find matches in a data frame is only partly working in R

2 Answers2