0

Hi I have not seen a similar solution to this problem I am having. I am trying to make a regrex pattern to extract the characters following the word major within { } and place them in a major column. However, the major repeats in row 2 and I need to extract and combine all characters within both { } following major. Ideally I would do this for minor and incidental attributes as well. Not sure what I am getting wrong here. Thanks!

test <- data.frame(lith=c("major{basalt} minor{andesite} incidental{dacite rhyolite}",
          "major {andesite flows} major {dacite flows}",
          "major{andesite} minor{dacite}",
          "major{basaltic andesitebasalt}"))

test %>%
  mutate(major = str_extract_all(test$lith, "[major].*[{](\\D[a-z]*)[}]") %>%
           map_chr(toString))

What I am looking for:

                         major    minor     incidental
1                       basalt andesite dacite ryolite
2 andesite flows, decite flows     <NA>           <NA>
3      basaltic andesitebasalt     <NA>           <NA>
r2evans
  • 141,215
  • 6
  • 77
  • 149

1 Answers1

0

First, (almost) never use test$ within a dplyr pipe starting with test %>%. At best it's just a little inefficient; if there are any intermediate steps that re-order, alter, or filter the data, then the results will be either (a) an error, preferred; or (b) silently just wrong. The reason: let's say you do

test %>%
  filter(grepl("[wy]", lith)) %>%
  mutate(major = str_extract_all(test$lith, ...))

In this case, the filter reduced the data from 4 rows to just 2 rows. However, since you're using test$lith, that's taken from the contents of test before the pipe started, so here test$lith is length-4 where we need it to be length-2.

Alternatively (and preferred),

test %>%
  filter(grepl("[wy]", lith)) %>%
  mutate(major = str_extract_all(lith, ...))

Here, the str_extract_all(lith, ...) sees only two values, not the original four.


On to the regularly-scheduled answer ...

I'll add a row number rn column as an original row reference (id of sources). This is both functional (for things to work internally) and in case you need to tie it back to the original data somehow. I'm inferring that you group the values together as strings instead of list-columns, though it's easy enough to switch to the latter if desired.

library(dplyr)
library(stringr) # str_extract_all
library(tidyr) # unnest, pivot_wider
test %>%
  mutate(
    rn = row_number(),
    tmp = str_extract_all(lith, "\\b([[:alpha:]]+) ?\\{([^}]+)\\}"),
    tmp = lapply(tmp, function(z) strcapture("^([^{}]*) ?\\{(.*)\\}", z, list(categ="", val="")))
  ) %>%
  unnest(tmp) %>%
  mutate(across(c(categ, val), trimws)) %>%
  group_by(rn, categ) %>%
  summarize(val = paste(val, collapse = ", ")) %>%
  pivot_wider(rn, names_from = "categ", values_from = "val") %>%
  ungroup()
# # A tibble: 4 x 4
#      rn incidental      major                        minor   
#   <int> <chr>           <chr>                        <chr>   
# 1     1 dacite rhyolite basalt                       andesite
# 2     2 NA              andesite flows, dacite flows NA      
# 3     3 NA              andesite                     dacite  
# 4     4 NA              basaltic andesitebasalt      NA      
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks @r2evans for the insight, This works great even if I add in more complicated strings such as "major {andesite flows} major {basaltic andesite flows} major {basalt flows} minor {volcanic breccia} minor {pyroclastic rocks}". The regex has me a bit confused. \\b is matching the word categories (e.g. major, minor incidental) via ([[:alpha:]]+) and then extracting anything within { } via ?\\{([^}]+)\\} – Wasatch801 Jul 15 '22 at 16:27
  • The `\\b` is a word-boundary, perhaps not strictly necessary but it's served me well in the past. From there, I think you are interpreting the pattern correctly. When learning/evaluating regex, I find it often useful to reference https://stackoverflow.com/a/22944075/3358272, https://regexr.com/, and https://regex101.com/, I hope that helps. – r2evans Jul 15 '22 at 16:29