1

I have the task of checking to see if values in a number of different columns appear in a character string in a field that contains an item name. If the values appear in this item name column, they need to be extracted and placed in a new column. I need to search one column at a time, so it will look like this: In column A, search for unique values from column B. I'll need to do this for a number of times where column A will always be the same, but the set of unique values from column B will be different because I'm using unique values from column B. Here's some example data:

Col_A <- c("blue shovel 1024", "red shovel 1022", "green bucket 3021",
           "green rake 3021", "yellow shovel 1023")
Col_B <- c("blue", "red", "green", "blue", "yellow")

df <- data.frame(Col_A, Col_B)
print(df)
               Col_A  Col_B  Col_C (output column)
1   blue shovel 1024   blue  blue
2    red shovel 1022    red  red
3  green bucket 3021  green  green
4    green rake 3021   blue  green
5 yellow shovel 1023 yellow  yellow

In the above case, I want to search for the unique values from Col_B and Col_A and then if any are found, place them in a new column (Col_C). If it doesnt find a value, or the value isnt what is expected (i.e., row 4) that's ok. I'm just trying to figure out how to make this happen.

I've tried using mutate and str_extract like follows:

mutate(New_Col = str_extract(Col_A, unique_Col_B_vals))

But I'm not really having any luck. Sometimes it will return a value I would expect, and other times it returns a value that doesn't make sense. For reference, "unique_Col_B_vals" above is a data frame. Wondering if maybe that is part of the problem?

I'm not dead set on this approach, so if there is a far better way to search over a set of unique values from one column in another column, I am all ears. Thanks!

*Edit

The dataset I'm working with has a lot of issues with consistency. Values in Col_A are much longer in the dataset and are supposed to be made up of different values from multiple fields (basically like a concatenate), but we know this is not happening correctly in many cases. So I'm taking unique values from various fields (e.g. Col_B) and seeing if one of those unique values pops up in Col_A. If it does, I want to extract that and bring it to a new column (Col_C) so that I can compare what is in Col_B vs what was extracted from Col_A.

Also for clarity's sake, what I want to happen is that for each value in Col_A, search through all the unique values in Col_B and extract whatever is found to Col_C.

I've tried the following as well, but get an error:

uniquevals <- list(unique(df$Col_B))

df <- df %>% 
  mutate(Col_C = str_extract(Col_A, uniquevals))

Error: Problem with `mutate()` column `Col_C`.
i `Col_C = str_extract(Col_A, uniquevals)`.
x no applicable method for 'type' applied to an object of class "list"
benh56
  • 37
  • 5
  • 1
    What is the output column here? – Tim Biegeleisen Dec 01 '21 at 13:00
  • 2
    Possible duplicate https://stackoverflow.com/q/6379032/680068 and https://stackoverflow.com/q/45090804/680068 and https://stackoverflow.com/q/34951410/680068 – zx8754 Dec 01 '21 at 13:03
  • 2
    `mutate(df, newcol = str_extract(Col_A, Col_B))`? I don't understand what you expect with `unique_Col_B_vals`: you say that row 4 should be different, I'm inferring that you mean it should be `NA` since `blue` is not in its `Col_A`. If instead you want to extract `green` because it is found in a different row, then perhaps `mutate(df, newcol = str_extract(Col_A, paste0("\\b(", paste0(unique(Col_B), collapse = "|"), ")\\b")))`. – r2evans Dec 01 '21 at 13:22
  • Apologies for not being more clear... a little backstory will probably help. Essentially I've been given a dataset that we know has a lot of issues with consistency. Col_A actually has a lot of information in it compared to my example above, and we've been told it is a compilation of a bunch of different attribute fields. I want to take a field (like Col_B) and gather the unique values (thought being I would expect to see one of them in Col_A) and then see if I'm able to find them in Col_A to verify they are there, which is how the dataset *should* work). First post on Stack Overflow :) – benh56 Dec 02 '21 at 13:56
  • @TimBiegeleisen we could call it Col_C. THanks for pointing out I hadn't made that clear. – benh56 Dec 02 '21 at 14:03
  • Are you looking for `str_extract(df$Col_A, df$Col_B)` ? – Nad Pat Dec 02 '21 at 14:12
  • @NadPat Sort of, but instead of just whatever value happens to be in df$Col_B, I want to search through all of the unique values in Col_B, i.e. blue, red, green, and blue. Just from looking at the data, I know there are cases where Col_A might have green within it, but the corresponding value from Col_B will be yellow, so I would want to pull out green and have it look like this: Col_A Col_B Col_C green bucket 3021 yellow green – benh56 Dec 02 '21 at 14:42
  • @r2evans I think your solution at the end did the trick. Thank you! I am fairly new to R and don't have much experience with regex yet. Would you be able to kind of break down what your code does? – benh56 Dec 03 '21 at 19:49
  • @benh56 done, see my answer, hope it helps – r2evans Dec 03 '21 at 20:02

1 Answers1

1

I think this works for you:

mutate(df, Col_C = stringr::str_extract(
  Col_A,
  paste0("\\b(", paste0(unique(Col_B), collapse = "|"), ")\\b")))
#                Col_A  Col_B  Col_C
# 1   blue shovel 1024   blue   blue
# 2    red shovel 1022    red    red
# 3  green bucket 3021  green  green
# 4    green rake 3021   blue  green
# 5 yellow shovel 1023 yellow yellow

Breakdown:

  • paste0(unique(Col_B), collapse="|") takes the words in Col_B, de-duplicates it, and concatenates them all together with | symbols; that is, c("blue","red","green") --> "blue|red|green". In regex, the | symbol is an "OR" operator.
  • \\b( and )\\b are word-boundaries, meaning that there isn't a word-like character immediately before (first) or after (second) the patterns; by adding this around the words, we prevent a partial match of blu on blue (in case that ever happens); while it is not apparent that this changes anything here, it's a more defensive/specific pattern. The parens add grouping, more evident in the next bullet.
  • With all of that, our overall pattern looks something like "\\b(blue|red|green)\\b" (abbreviated). This translates into "find blue or red or green such that there is a word-boundary on both ends of whichever one(s) you find".
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you so much! This worked great. Really appreciate the breakdown of the code too! – benh56 Dec 05 '21 at 02:08
  • Ah sorry, thanks for pointing me in the right direction (again). :) – benh56 Dec 06 '21 at 14:40
  • Apologies for continuing to post after marking this solved... running into an issue for two of the 10 or so fields I'm analyzing. Getting an "x Syntax error in regex pattern" error whne I run it on field where some of the values begin with an asterisk (*) e.g. "* red bucket". The solution you provided has worked on fields that began with a - or #, but I think the * is giving it some issues. Would you have any idea on how to tweak the code? Based on your explanation above I was thinking maybe telling R to use both \\b and \\B boundaries, but I wasn't successful in making that work. – benh56 Dec 06 '21 at 15:40