0

I'm currently working on preparing a dataset for data analysis in RStudio. I'm using the below code to duplicate the data so that for every variable column I create a new variable column next to it where I put the cleaned data. This works well for putting a single number from the original column into the next to then translate it into a written answer, but I don't know how to take a number from an answer with text. In the code below, column v1 has inputs that are sentences with numbers in them. When I use my mutate code, it doesn't transfer anything because the data is seen as text. I was wondering if there's a way to take the number from the data and put it into the new column. My goal is for column v1.1 to have 11 and 22 in it rather than the whole sentences that are column v1.

library(tidyr)
library(dplyr)

df <- data.frame(v1=c("11 because of reason x","22 but I like this"),
              pages=c(32,45),
              name=c("spark","python"))
df

df2 <- cbind(df, df)
df2 <- df2[, sort(names(df2))]
df2[, seq(2, 6,by=2)] <- NA
names(df2) <- sub("\\.", ".", names(df2))

df2 <- df2 %>%
  mutate(v1.1 = ifelse( (v1 == 11)|(v1 == 22), v1, v1.1))

I'm hoping to make it so that I can use the mutate function from above and include some kind of stipulation to identify if a number is present at all in a cell even if it has text with it and to only put the number in the next corresponding column. I found this code below to separate numbers from text but it really didn't work for me. I could make it work if I can somehow include it under the mutate function.

df2 <- df1 %>%
  separate(v1, 
          into = c("text", "num"), 
          sep = "(?<=[A-Za-z])(?=[0-9])"
          )

When I used the above code to separate numbers and texts it also didn't work because the number values are in a sentence with parentheses and stuff and it seems like the code above only works for stuff like "AB55". I need a way to separate something like "(5+6)I think" into just a "5" or a "6". Is that at all possible? Thank you! I hope you all have a great day!

  • 5
    Could you edit your question to make it reproducible? Some tips to do that are [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Side note, you can likely shorten your `ifelse` logic by saying `v1 %in% 1:8` – jpsmith Jul 28 '23 at 13:01
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Jul 28 '23 at 13:06
  • Hi! Thank you for the input, I'm going to make my question reproducible in a bit, just have to get something else done first. Thank you! – Miguel Von Fedak Jul 28 '23 at 13:23
  • Hi, I just edited my question to be reproducible. Thank you @jpsmith – Miguel Von Fedak Jul 29 '23 at 12:37
  • Hi Miguel, welcome to StackOverflow! – Mark Jul 30 '23 at 06:45
  • Can you clarify in the example you gave "(5+6)I think", would you want 5, 6, 5 AND 6, or something else? – Mark Jul 30 '23 at 06:46
  • @Mark 5 AND 6 would be preferable, thank you for asking :) – Miguel Von Fedak Jul 31 '23 at 09:54
  • @MiguelVonFedak updated! :-) – Mark Jul 31 '23 at 10:05

1 Answers1

0

Here's one approach:

library(tidyverse)
df2 |> 
  mutate(v1.1 = map(v1, ~ str_extract_all(.x, "\\d+", simplify = TRUE) |> as.numeric()))

# Output
    name name.1 pages pages.1                     v1 v1.1
1  spark     NA    32      NA 11 because of reason x   11
2 python     NA    45      NA     22 but I like this   22
3      R     NA    58      NA           (5+6)I think 5, 6

This gets the every number from the strings. If you'd prefer not to use stringr, grep is a base R function with a very similar syntax.

# Input:
df <- data.frame(v1=c("11 because of reason x","22 but I like this", "(5+6)I think"),
              pages=c(32,45, 58),
              name=c("spark","python", "R"))
Mark
  • 7,785
  • 2
  • 14
  • 34