1

I have a dataframe that looks like the below:

BaseRating    contRating Participant
5,4,6,3,2,4       5        01       
4                 4        01   

I would first like to run some code that looks to see whether there are any commas in the dataframe, and returns a column number of where that is. I have tried some of the solutions in the questions below, which don't seem to work when looking for a comma instead of a string/whole value? I'm probably missing something simple here but any help appreciated!

Selecting data frame rows based on partial string match in a column

Filter rows which contain a certain string

Check if value is in data frame

Having determined whether there are commas in my data, I then want to extract just the last number in the list separated by commas in that entry, and replace the entry with that value. For instance, I want the first row in the BaseRating column to become '4' because it is last in that list.

Is there a way to do this in R without manually changing the number?

2 Answers2

1

A possible solution is bellow.

EXPLANATION

In what follows, I will explain the regex expression used in str_extract function, as asked for by @milsandhills:

  1. The symbol | in the middle means the logical OR operator.

  2. We use that because BaseRating can have multiple numbers or only one number — hence the need to use |, to treat each case separately.

  3. The left-hand side of | means a number formed by one or more digits (\\d+), which starts (^) and finishes the string ($).

  4. The right-hand side of | means a number formed by one or more digits (\\d+), which finishes the string ($). And (?<=\\,) is used to guarantee that the number is preceded by a comma.

You can find more details at the stringr cheat sheet.

library(tidyverse)

df <- data.frame(
  BaseRating = c("5,4,6,3,2,4", "4"),
  contRating = c(5L, 4L),
  Participant = c(1L, 1L)
)

df %>% 
  mutate(BaseRating = sapply(BaseRating, 
         function(x) str_extract(x, "^\\d+$|(?<=\\,)\\d+$") %>% as.integer))

#>   BaseRating contRating Participant
#> 1          4          5           1
#> 2          4          4           1

Or:

library(tidyverse)

df %>% 
  separate_rows(BaseRating, sep = ",", convert = TRUE) %>% 
  group_by(contRating, Participant) %>% 
  summarise(BaseRating = last(BaseRating), .groups = "drop") %>% 
  relocate(BaseRating, .before = 1)

#> # A tibble: 2 × 3
#>   BaseRating contRating Participant
#>        <int>      <int>       <int>
#> 1          4          4           1
#> 2          4          5           1
PaulS
  • 21,159
  • 2
  • 9
  • 26
  • 1
    Thanks, this is really helpful! My actual data frame has a lot more columns than in my user-friendly example, so I wondered if there is a quicker way than grouping by all my existing columns? – milsandhills Jan 24 '22 at 16:21
  • Maybe the new solution I just added to my answer is what you are looking for, @milsandhills. – PaulS Jan 24 '22 at 16:26
  • Your second option works wonderfully. Could you perhaps explain a bit more about what the code within the str_extract function is actually doing, as I'm a bit puzzled by all the symbols? – milsandhills Jan 25 '22 at 15:40
  • I have just added an explanation to my answer, @milsandhills. Hope that helps you! If you have any further question, please do not hesitate and ask me. – PaulS Jan 25 '22 at 17:16
  • 1
    Thanks so much, that's great! – milsandhills Jan 27 '22 at 11:39
0

If we want a quick option, we can use trimws from base R

df$BaseRating <- as.numeric(trimws(df$BaseRating, whitespace = ".*,"))

-output

> df
  BaseRating contRating Participant
1          4          5           1
2          4          4           1

Or another option is stri_extract_last

library(stringi)
df$BaseRating <- as.numeric(stri_extract_last_regex(df$BaseRating, "\\d+"))

data

df <- structure(list(BaseRating = c("5,4,6,3,2,4", "4"), contRating = 5:4, 
    Participant = c(1L, 1L)), class = "data.frame", row.names = c(NA, 
-2L))
akrun
  • 874,273
  • 37
  • 540
  • 662