1

I have a dataframe 'df1' with a string column 'Field_notes' of various information that looks like this:

Alt_ID Field_notes
JMs # 04J0342
JMs # 04J0343
JMs # 04J0344
# broken leg
54.2
JMs # 04J0345

I would like to extract parts of the strings from the "Field_notes" column for specific rows only to the "Alt_ID" column. In this case, I'd like to subset rows 1,2,3,6 so that the alphanumeric combination after "JMs # " is moved to the "Alt_ID" column, so the result looks like:

Alt_ID Field_notes
04J0342 JMs # 04J0342
04J0343 JMs # 04J0343
04J0344 JMs # 04J0344
# broken leg
54.2
04J0345 JMs # 04J0345

The tricky part is that there are so many combinations of information in Field_notes that I probably can't rely on character patterns and instead have to rely on specifying row names/numbers. In this case, I don't want to extract anything from '# broken leg'.

Jacob
  • 329
  • 2
  • 10
  • Try `Alt_ID <- sub('.*#\\s+', '', df1$Field_notes)`. – Rui Barradas Nov 17 '22 at 17:02
  • It worked for the rows of interest, but also grabbed information from all other rows and placed it in Alt_ID, so not the answer I was hoping for. However, maybe I can refine your answer. Can you explain what this section means: '.*#\\s+'? – Jacob Nov 17 '22 at 17:05
  • If you can't rely on patterns how would you like to identify which rows you want to pick out to get the substrings? And once you get those rows, without a standard pattern how would you know what substrings to extract? – henryn Nov 17 '22 at 17:26

3 Answers3

2

Update:

Here is another approach: Now you don't rely on # but now you rely on count of characters:

df %>% 
  mutate(Alt_ID = ifelse(nchar(Field_notes)==13,  str_extract(Field_notes,  '\\b\\w+$'), ""), .before=1)      
  Alt_ID   Field_notes
1 04J0342 JMs # 04J0342
2 04J0343 JMs # 04J0343
3 04J0344 JMs # 04J0344
4            broken leg
5                  54.2
6 04J0345 JMs # 04J0345

First answer:

A combination of regex '\\b\\w+$' and stringr functions:

library(dplyr)
library(stringr)
df %>% 
  mutate(Alt_ID= str_extract(Field_notes,  '\\b\\w+$')) %>% 
  mutate(Alt_ID = ifelse(str_detect(Field_notes, '\\#'), Alt_ID, ""))
    Field_notes  Alt_ID
1 JMs # 04J0342 04J0342
2 JMs # 04J0343 04J0343
3 JMs # 04J0344 04J0344
4    broken leg        
5          54.2        
6 JMs # 04J0345 04J0345
TarJae
  • 72,363
  • 6
  • 19
  • 66
  • I edited the comment to add an # in one of the rows where I don't want the information moved to Alt_ID. The idea is I probably can't rely on the # as an indicator, but instead probably have to rely on row names/numbers. – Jacob Nov 17 '22 at 17:10
  • 2
    @Jacob could you add a row to your question where `#` is contained but you do not need the information? Or rather, what is the pattern you are looking for? – Onyambu Nov 17 '22 at 17:18
  • Can you please explain `'\\b\\w+$'`? – Ed_Gravy Nov 17 '22 at 17:19
  • @onyambu I'm trying to avoid looking for patterns to do this because I have tens of thousands of rows in the actual dataframe with all sorts of information. I'd rather be able to subset information by specifying row numbers and then identifying within those rows the pattern needed to extract the information I want. Does that make sense? – Jacob Nov 17 '22 at 17:20
  • Exelently explained here by @Ωmega – TarJae Nov 17 '22 at 17:20
  • 1
    @Jacob then since there is no pattern, there is no way to extract the information. I see the pattern `JMs #` but since that is not the pattern, you cannot be able to extract the information. – Onyambu Nov 17 '22 at 17:22
  • @onyambu Surely there must be a way to just extract partial strings from specific rows. No? – Jacob Nov 17 '22 at 17:22
  • @Jacob Sure but the partial string should have a pattern or any other attribute so that we can identify! :-) – TarJae Nov 17 '22 at 17:23
  • @Jacob Not unless you know the pattern or what you want from the string. ie do you know the position? do you know the pattern? or do you know the string?eg If I tell you i want a `2` from `abcd2` and then I want a `c` from `dcef` and I ask you to get the `a` from `dfga`, there is no automation to get the `a` since no enough information – Onyambu Nov 17 '22 at 17:25
  • @TarJae I agree. In this case, I first want to specify rows 1,2,3, and 6 and then subset the string after the pattern "JMs # ". The problem is, the pattern will change for each set of rows I specify. The part I'm struggling with is how to specify rows first. – Jacob Nov 17 '22 at 17:26
  • Best would be to give a light full example. I am sure that specifying rows will not be the problem. – TarJae Nov 17 '22 at 17:28
  • @TarJae Is it possible you could show how to specify rows and combine that with the pattern "JMs # " so I can see how it is done? – Jacob Nov 17 '22 at 17:31
  • 1
    If you know which rows you want then it is no problem, but I think that you don't know the row numbers. You can do with `arrange(desc(Field_notes))` a grouping of patterns visually. I have edited the answer with a new approach! Good luck :-) – TarJae Nov 17 '22 at 17:37
2

Here's another solution:

library(dplyr)
library(stringr)

df <- data.frame(
  Alt_ID = NA_character_,
  Field_notes = c("JMs # 04J0342", "JMs # 04J0343", "JMs # 04J0344",
                   "# broken leg", "54.2", "JMs # 04J0345")
)

id_pattern <- "(?<=JMs # )\\w+"

df %>%
  mutate(
    Alt_ID = str_extract(Field_notes, id_pattern)
  )
#    Alt_ID   Field_notes
# 1 04J0342 JMs # 04J0342
# 2 04J0343 JMs # 04J0343
# 3 04J0344 JMs # 04J0344
# 4    <NA>  # broken leg
# 5    <NA>          54.2
# 6 04J0345 JMs # 04J0345

# Or equivalently:
df$Alt_ID <- str_extract(df$Field_notes, id_pattern)
df
#    Alt_ID   Field_notes
# 1 04J0342 JMs # 04J0342
# 2 04J0343 JMs # 04J0343
# 3 04J0344 JMs # 04J0344
# 4    <NA>  # broken leg
# 5    <NA>          54.2
# 6 04J0345 JMs # 04J0345

The regex "(?<=JMs # )\\w+" is interpreted as find the substring "JMs # and match the alphanumeric substring that follows.

Santiago
  • 641
  • 3
  • 14
1

You could try this, using the regex JMs\\s#\\s(\\w+)$ and picking out the capturing group:

library(stringr)
library(dplyr)

mutate(df, Alt_ID = stringr::str_match(field_notes, "JMs\\s#\\s(\\w+)$")[, 2])

Output:

   Alt_ID   field_notes
1 04J0342 JMs # 04J0342
2    <NA>  # broken leg
3 04J0334 JMs # 04J0334
henryn
  • 1,163
  • 4
  • 15
  • This is the best answer so far in that it only works on the rows I listed in the example, so kudos and thank you. Instead of specifying rows to subset, I could just update the pattern listed here which is a good option. However, I'd still like to know how to specify rows. – Jacob Nov 17 '22 at 17:36
  • I guess my remaining question is, if I keep updating the pattern recognition section of code to subset other rows, will the NAs overwrite the updates of previous pattern matches? – Jacob Nov 17 '22 at 17:40
  • 1
    Do you already know which rows you want to subset / is there a clear way of identifying which ones? Or alternatively, if the pattern varies a bit, do you have a set of formats you'd want to look for rather than just the 'JMs # 1111' one that you've got in your example? I think potentially updating your example to reflect how these patterns can vary could be helpful – henryn Nov 17 '22 at 17:42