Create new column from an existing column with pattern matching in R

Question

I'm trying to add a new column based on another using pattern matching. I've read this post, but not getting the desired output.

I want to create a new column (SubOrder) based on the GreatGroup column. I have tried the following:

SubOrder <- rep(NA_character_, length(myData))

SubOrder[grepl("udults", myData, ignore.case = TRUE)] <-  "Udults"
SubOrder[grepl("aquults", myData, ignore.case = TRUE)] <-  "Aquults"
SubOrder[grepl("aqualfs", myData, ignore.case = TRUE)] <-  "aqualfs"
SubOrder[grepl("humods", myData, ignore.case = TRUE)] <-  "humods"
SubOrder[grepl("udalfs", myData, ignore.case = TRUE)] <-  "udalfs"
SubOrder[grepl("orthods", myData, ignore.case = TRUE)] <-  "orthods"
SubOrder[grepl("udalfs", myData, ignore.case = TRUE)] <-  "udalfs"
SubOrder[grepl("psamments", myData, ignore.case = TRUE)] <-  "psamments"
SubOrder[grepl("udepts", myData, ignore.case = TRUE)] <-  "udepts"
SubOrder[grepl("fluvents", myData, ignore.case = TRUE)] <-  "fluvents"
SubOrder[grepl("aquods", myData, ignore.case = TRUE)] <-  "aquods"

For example, I'm looking for "udults" inside any word, such as Hapludults or Paleudults, and return just "udults".

EDIT: If anyone wants to take a shot at alistaire's comment, this is the search patterns I would use.

 subOrderNames <- c("Udults", "Aquults", "Aqualfs", "Humods", "Udalfs", "Orthods", "Psamments", "Udepts", "fluvents")

Example data below.

myData <- dput(head(test))
structure(list(1:6, SID = c(200502L, 200502L, 200502L, 200502L, 
200502L, 200502L), Groupdepth = c(11L, 12L, 13L, 14L, 21L, 22L
), AWC0to10 = c(0.12, 0.12, 0.12, 0.12, 0.12, 0.12), AWC10to20 = c(0.12, 
0.12, 0.12, 0.12, 0.12, 0.12), AWC20to50 = c(0.12, 0.12, 0.12, 
0.12, 0.12, 0.12), AWC50to100 = c(0.15, 0.15, 0.15, 0.15, 0.15, 
0.15), Db3rdbar0to10 = c(1.43, 1.43, 1.43, 1.43, 1.43, 1.43), 
    Db3rdbar10to20 = c(1.43, 1.43, 1.43, 1.43, 1.43, 1.43), Db3rdbar20to50 = c(1.43, 
    1.43, 1.43, 1.43, 1.43, 1.43), Db3rdbar50to100 = c(1.43, 
    1.43, 1.43, 1.43, 1.43, 1.43), HydrcRatngPP = c(0L, 0L, 0L, 
    0L, 0L, 0L), OrgMatter0to10 = c(1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25), OrgMatter10to20 = c(1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25), OrgMatter20to50 = c(1.02, 1.02, 1.02, 1.02, 1.02, 
    1.02), OrgMatter50to100 = c(0.12, 0.12, 0.12, 0.12, 0.12, 
    0.12), Clay0to10 = c(8, 8, 8, 8, 8, 8), Clay10to20 = c(8, 
    8, 8, 8, 8, 8), Clay20to50 = c(9.4, 9.4, 9.4, 9.4, 9.4, 9.4
    ), Clay50to100 = c(40, 40, 40, 40, 40, 40), Sand0to10 = c(85, 
    85, 85, 85, 85, 85), Sand10to20 = c(85, 85, 85, 85, 85, 85
    ), Sand20to50 = c(83, 83, 83, 83, 83, 83), Sand50to100 = c(45.8, 
    45.8, 45.8, 45.8, 45.8, 45.8), pHwater0to20 = c(6.3, 6.3, 
    6.3, 6.3, 6.3, 6.3), Ksat0to10 = c(23, 23, 23, 23, 23, 23
    ), Ksat10to20 = c(23, 23, 23, 23, 23, 23), Ksat20to50 = c(19.7333, 
    19.7333, 19.7333, 19.7333, 19.7333, 19.7333), Ksat50to100 = c(9, 
    9, 9, 9, 9, 9), TaxClName = c("Fine, mixed, semiactive, mesic Oxyaquic Hapludults", 
    "Fine, mixed, semiactive, mesic Oxyaquic Hapludults", "Fine, mixed, semiactive, mesic Oxyaquic Hapludults", 
    "Fine, mixed, semiactive, mesic Oxyaquic Hapludults", "Fine, mixed, semiactive, mesic Oxyaquic Hapludults", 
    "Fine, mixed, semiactive, mesic Oxyaquic Hapludults"), GreatGroup = c("Hapludults", 
    "Hapludults", "Hapludults", "Hapludults", "Hapludults", "Hapludults"
    )), .Names = c("", "SID", "Groupdepth", "AWC0to10", "AWC10to20", 
"AWC20to50", "AWC50to100", "Db3rdbar0to10", "Db3rdbar10to20", 
"Db3rdbar20to50", "Db3rdbar50to100", "HydrcRatngPP", "OrgMatter0to10", 
"OrgMatter10to20", "OrgMatter20to50", "OrgMatter50to100", "Clay0to10", 
"Clay10to20", "Clay20to50", "Clay50to100", "Sand0to10", "Sand10to20", 
"Sand20to50", "Sand50to100", "pHwater0to20", "Ksat0to10", "Ksat10to20", 
"Ksat20to50", "Ksat50to100", "TaxClName", "GreatGroup"), class = c("tbl_df", 
"data.frame"), row.names = c(NA, -6L))

To make your code a lot DRYer, make a vector of your patterns (and replacements, if they're different), and use `sapply` to call `grepl` or `gsub` or whatever you like. — alistaire, Feb 05 '16 at 21:32
I attempted something similar with : subOrderNames <- c("Udults", "Aquults", "Aqualfs", "Humods", "Udalfs", "Orthods", "Psamments", "Udepts", "fluvents") but got lost in the machine 0o0o0 — derelict, Feb 05 '16 at 21:51
With a `for` loop, `pat <- c('udults', 'aquults', 'aqualfs', 'humods', 'udalfs', 'orthods', 'psamments', 'udepts', 'fluvents', 'aquods'); for(x in 1:length(pat)){SubOrder[grepl(pat[x], myData$GreatGroup, ignore.case = TRUE)] <- pat[x]}` Make a second vector for replacements and substitute it for the second `pat[x]` if you need. — alistaire, Feb 05 '16 at 21:54
Or more directly, `myData$SubOrder <- myData$GreatGroup; for(x in pat){myData$SubOrder <- gsub(paste0('.*', x, '.*'), x, myData$SubOrder, ignore.case = TRUE)}`. If there's no match in this case, it's left with the value of `GreatGroup` instead of `NA`, though. — alistaire, Feb 05 '16 at 22:13

alistaire · Accepted Answer · 2016-02-05T23:33:27.293

A few options, some of which I posted in the comments above.

Note: All options assume the replacement for the strings that match patters are just the pattern. If you want something else, they're all easily editable to include separate replacement values.

Option 1: `for` + `grepl`

Using the same code as the original, but looping to avoid repetitive code:

# make a list of patterns
pat <- c('udults', 'aquults', 'aqualfs', 'humods', 'udalfs', 'orthods', 'psamments', 'udepts', 'fluvents', 'aquods')

SubOrder <- rep(NA_character_, length(myData))

for(x in 1:length(pat)){
  SubOrder[grepl(pat[x], myData$GreatGroup, ignore.case = TRUE)] <-  pat[x]
}

Option 2: `for` + `gsub`

Build the new column in place by copying myData$GreatGroup and then altering it with gsub. The extra regex pasted on includes characters within the same string.

myData$SubOrder <- myData$GreatGroup
for(x in pat){
  myData$SubOrder <- gsub(paste0('.*', x, '.*'), x, myData$SubOrder, ignore.case = TRUE)
}

Note that values not matched by one of the strings in pat will have the value from GreatGroup, not NA. If you want them to be NA, fix them with

myData$SubOrder[!(myData$SubOrder %in% pat)] <- NA

Option 3: named list + `stringr::str_replace_all`

My favorite because it doesn't loop, although it requires the stringr package (which is pretty awesome, anyway).

Make a named list from pat, where the name is the regex you want to replace, and the item is the string to match:

l <- as.list(pat)
names(l) <- paste0('.*', pat, '.*')

so it looks like

> l
$`.*udults.*`
[1] "udults"

$`.*aquults.*`
[1] "aquults"

$`.*aqualfs.*`
[1] "aqualfs"
......

Then use str_replace_all to DO IT ALL AT ONCE:

myData$SubOrder <- str_replace_all(myData$GreatGroup, l)

Boom.

Note 1: str_replace_all doesn't have an ignore.case option, but you can wrap myData$GreatGroup in tolower (easy) or reconfigure the regex (hard).

Note 2: Like Option 2, it leaves unmatched entries as the value from GreatGroup, so use the line at the end of that option to go back to NAs, if you like.

+1 for Option 3... though now it doesn't work with named lists, just named vectors `names(pat) <- paste0('.*', pat, '.*')` and then `myData$SubOrder <- str_replace_all(myData$GreatGroup, pat)` does the trick. — seapen, Mar 27 '20 at 15:31

Mist · Answer 2 · 2016-02-05T21:40:45.593

1

Try this:

myData$SubOrder[grepl("udults", myData$TaxClName, ignore.case = TRUE) | grepl("udults", myData$GreatGroup, ignore.case = TRUE)] <-  "Udults"

You can add as many columns to the filter as you want.

edited Feb 05 '16 at 21:40

answered Feb 05 '16 at 21:29

Mist

1,888
1
14
21

score 1 · Answer 3 · answered Feb 05 '16 at 21:29

I'm using dplyr, but you probably need to create a giant nested ifelse statement...

library(dplyr)

myData %>%
  mutate(SubOrder = ifelse(grepl('udults', GreatGroup, ignore.case = T), 'Udults',
                           ifelse(grepl('aquults', GreatGroup, ignore.case = T, 'Aquults',
                                        ###  All of the other ifelse statements
                                        ifelse(grepl('fluvents', GreatGroup, ignore.case = T), 'fluvents', 'aquods')
                           ))))

eipi10 · Answer 4 · 2016-02-05T23:36:22.350

You could do this with a function that successively substitutes each pattern, which avoids repeating your code over and over. Note that with this approach, if a given string matches more than one pattern, the first pattern in the substitution sequence will be the one that gets used.

# multi-grepl function adapted from http://stackoverflow.com/a/15254254/496488
mgrepl <- function(pattern, replacement, x, ...) {
  if (length(pattern) != length(replacement)) {
    stop("pattern and replacement do not have the same length.")
  }
  result <- x
  for (i in 1:length(pattern)) {
    result[grepl(pattern[i], result, ...)] = replacement[i]
  }
  result
}

# Patterns and replacements
pat = c("udults","aquults","humods","fluvents")
repl = c("Udults","Aquults","humods","fluvents")

SubOrder =  mgrepl(pat, repl, myData$GreatGroup)

SubOrder

[1] "Udults" "Udults" "Udults" "Udults" "Udults" "Udults"

# Or, if you want to add this as a new column to the data:
myData$SubOrder = mgrepl(pat, repl, myData$GreatGroup)

One additional note: One issue with the code in your question is that you referenced the whole data frame, rather than the column you want to substitute:

SubOrder[grepl("udults", myData, ignore.case = TRUE)] <-  "Udults"

should be changed to

SubOrder[grepl("udults", myData$GreatGroup, ignore.case = TRUE)] <-  "Udults"

UPDATE: Regarding your comment, see the code below. The function does replace both values with "Udults".

myData$GreatGroup[1] = "Paleudults"

myData$GreatGroup

[1] "Paleudults" "Hapludults" "Hapludults" "Hapludults" "Hapludults" "Hapludults"

mgrepl(pat, repl, myData$GreatGroup)

[1] "Udults" "Udults" "Udults" "Udults" "Udults" "Udults"

Thanks for pointing out the error in my code. Also, it appears this code returns the pattern as the replacement. I would need to be able to search for 'udults' in multiple words such as Hapludults, Paleudults, and just return udults for both. — derelict, Feb 05 '16 at 22:35
See the update to my answer. The code in my answer seems to be doing the correct thing as far as I can tell. — eipi10, Feb 05 '16 at 23:36

Create new column from an existing column with pattern matching in R

4 Answers4

Option 1: `for` + `grepl`

Option 2: `for` + `gsub`

Option 3: named list + `stringr::str_replace_all`

Linked

Create new column from an existing column with pattern matching in R

4 Answers4

Option 1: for + grepl

Option 2: for + gsub

Option 3: named list + stringr::str_replace_all

Linked

Option 1: `for` + `grepl`

Option 2: `for` + `gsub`

Option 3: named list + `stringr::str_replace_all`