I have a data frame with ~300 observations, each associated with a numeric code that I want to split into its component digits. The code variable is either a 3 or 4 digit integer, aligned by its last digit, & so my desired output would look something like this:
code d4 d3 d2 d1
403 <NA> 4 0 3
5123 5 1 2 3
105 <NA> 1 0 5
While I can see lots of ways to divide the code using strsplit
(base R) or stringr::str_split
, I am having difficulty applying any of these operations to my data frame.
library(stringr)
as.integer(unlist(str_split(5123, ""))[1]) # returns 5, the first digit - correct
as.integer(rev(unlist(str_split(5123, "")))[1]) # returns 3, the last digit - correct
But the plausible (to me) operation
libray(dplyr)
df <- data.frame(code = c(403, 5123, 105))
df <- df %>%
mutate(
last = as.integer(rev(unlist(str_split(df$code,"")))[4])
)
returns
> df
code last
1 403 3
2 5123 3
3 105 3
Clearly my understanding of how operations on lists and atomic vectors are handled within data frames is lacking...
I then felt sure that either the separate()
or extract()
functions from the tidyr
package would help. Certainly, tidyr::separate()
produces the desired result if the codes are supplied as strings with a leading space before each digit:
library(tidyr)
dfsep <- data.frame(code = c(" 4 0 3", "5 1 2 3", " 1 0 5"))
dfsep <- dfsep %>%
separate(
code, c("d4", "d3", "d2", "d1"), fill = "right", remove = FALSE
)
dfsep
code d4 d3 d2 d1
1 4 0 3 4 0 3
2 5 1 2 3 5 1 2 3
3 1 0 5 1 0 5
But a continuous string of digits cannot be split in this way; and empty search patterns are not supported by tidyr::separate()
df <- data.frame(code = c(403, 5123, 105))
df <- df %>%
separate(
code, c("d4", "d3", "d2", "d1"), fill = "right", remove = FALSE
)
df
code d4 d3 d2 d1
1 403 403 <NA> <NA> <NA>
2 5123 5123 <NA> <NA> <NA>
3 105 105 <NA> <NA> <NA>
While the problem with tidyr::extract()
is that although it extracts the digits beautifully I have not been able to find a set of arguments that handles both 3 & 4 digit integers:
dfext <- data.frame(code = c(403, 5123, 105))
dfext <- dfext %>%
extract(
code, c("d4", "d3", "d2", "d1"), "(.)(.)(.)(.)", remove = FALSE
)
dfext
code d4 d3 d2 d1
1 403 <NA> <NA> <NA> <NA>
2 5123 5 1 2 3
3 105 <NA> <NA> <NA> <NA>
Perhaps I have not understood how to construct the correct regex code for my purpose...
I have looked at related questions on StackOverflow including this one about separate() and this one about extract(), but I could not see how to apply the answers to my own problem. The question here gives a solution for a variable with values of fixed length, not variable.
Any help, tips or observations would be much appreciated!
P.S. To give context, this is a data frame of dives in a diving competition. Every row represents one dive, a single observation with multiple grouping variables: name, age, sex, dive number (e.g. 1 of 5), board height, dive code, dive position, tariff, J1 award, J2 award, ... J5 award, total award (dropping highest & lowest awards), & score (total award multiplied by tariff). The codes are determined by FINA