4

I have a data frame with ~300 observations, each associated with a numeric code that I want to split into its component digits. The code variable is either a 3 or 4 digit integer, aligned by its last digit, & so my desired output would look something like this:

code    d4 d3 d2 d1
 403  <NA>  4  0  3 
5123     5  1  2  3
 105  <NA>  1  0  5    

While I can see lots of ways to divide the code using strsplit (base R) or stringr::str_split, I am having difficulty applying any of these operations to my data frame.

library(stringr)
as.integer(unlist(str_split(5123, ""))[1]) # returns 5, the first digit - correct
as.integer(rev(unlist(str_split(5123, "")))[1]) # returns 3, the last digit - correct

But the plausible (to me) operation

libray(dplyr)
df <- data.frame(code = c(403, 5123, 105))
df <- df %>% 
  mutate(
    last = as.integer(rev(unlist(str_split(df$code,"")))[4])
  )

returns

> df
  code last
1  403    3
2 5123    3
3  105    3

Clearly my understanding of how operations on lists and atomic vectors are handled within data frames is lacking...

I then felt sure that either the separate() or extract() functions from the tidyr package would help. Certainly, tidyr::separate() produces the desired result if the codes are supplied as strings with a leading space before each digit:

library(tidyr)
dfsep <- data.frame(code = c(" 4 0 3", "5 1 2 3", " 1 0 5"))
dfsep <- dfsep %>% 
  separate(
    code, c("d4", "d3", "d2", "d1"), fill =  "right", remove = FALSE
    )

dfsep
     code d4 d3 d2 d1
1   4 0 3     4  0  3
2 5 1 2 3  5  1  2  3
3   1 0 5     1  0  5

But a continuous string of digits cannot be split in this way; and empty search patterns are not supported by tidyr::separate()

df <- data.frame(code = c(403, 5123, 105))
df <- df %>% 
  separate(
    code, c("d4", "d3", "d2", "d1"), fill =  "right", remove = FALSE
  )

df
  code   d4   d3   d2   d1
1  403  403 <NA> <NA> <NA>
2 5123 5123 <NA> <NA> <NA>
3  105  105 <NA> <NA> <NA>

While the problem with tidyr::extract() is that although it extracts the digits beautifully I have not been able to find a set of arguments that handles both 3 & 4 digit integers:

dfext <- data.frame(code = c(403, 5123, 105))
dfext <- dfext %>% 
  extract(
    code, c("d4", "d3", "d2", "d1"), "(.)(.)(.)(.)", remove = FALSE
    ) 

dfext
  code   d4   d3   d2   d1
1  403 <NA> <NA> <NA> <NA>
2 5123    5    1    2    3
3  105 <NA> <NA> <NA> <NA>

Perhaps I have not understood how to construct the correct regex code for my purpose...

I have looked at related questions on StackOverflow including this one about separate() and this one about extract(), but I could not see how to apply the answers to my own problem. The question here gives a solution for a variable with values of fixed length, not variable.

Any help, tips or observations would be much appreciated!

P.S. To give context, this is a data frame of dives in a diving competition. Every row represents one dive, a single observation with multiple grouping variables: name, age, sex, dive number (e.g. 1 of 5), board height, dive code, dive position, tariff, J1 award, J2 award, ... J5 award, total award (dropping highest & lowest awards), & score (total award multiplied by tariff). The codes are determined by FINA

Community
  • 1
  • 1

4 Answers4

2

The regex should be "(.)?(.)(.)(.)"

? to be used for item occurs zero or one time

dfext %>% extract(code, c('d1','d2','d3','d4'), "(.)?(.)(.)(.)")
d1 d2 d3 d4
1 <NA>  4  0  3
2    5  1  2  3
3 <NA>  1  0  5
Alfred Tso
  • 51
  • 4
1

We can use stri_list2matrix from stringi after splitting with strsplit

n <- max(nchar(df$code)) #get the maximum number of characters
fmt <- paste0('%', n, 'd') #create a format for the `sprintf`
library(stringi)
#the list output from `strsplit` can be coerced to `matrix` using
#stri_list2matrix.
d1 <- stri_list2matrix(strsplit(sprintf( fmt, df$code), ''), byrow=TRUE)
#But, the output is character class, which we can convert to 'numeric' 
m1 <- matrix(as.numeric(d1), ncol=ncol(d1), nrow=nrow(d1))
m1
#     [,1] [,2] [,3] [,4]
#[1,]   NA    4    0    3
#[2,]    5    1    2    3
#[3,]   NA    1    0    5

For the 'dfsep' dataset

v1 <- gsub('\\s+', '', dfsep$code)
n <- max(nchar(v1))
fmt <- paste0('%', n, 's')
d1  <- stri_list2matrix(strsplit(sprintf(fmt, v1), ''), byrow=TRUE)
m1 <- matrix(as.numeric(d1), ncol=ncol(d1), nrow=nrow(d1))
m1
#     [,1] [,2] [,3] [,4]
#[1,]   NA    4    0    3
#[2,]    5    1    2    3
#[3,]   NA    1    0    5

We can cbind with the original dataset

cbind(dfsep, m1)

This can be made into a function for applying to different datasets.

akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Thank you @akrun! For the basic data frame df the code `df <- data.frame(code = c(403, 5123, 105)); df <- cbind(df, data.frame(stri_list2matrix(strsplit(sprintf('%04d', df$code), ''), byrow=TRUE)))` does exactly what I was hoping for. – Andrew Graham Oct 30 '15 at 15:46
  • I have ticked your solution, thank you. I will now go away & read up on the stringi package – Andrew Graham Oct 30 '15 at 15:56
1

Only tested a few cases, but this should also work for different types of input

f <- function(df) {
  f <- tempfile()
  df$ccode <- gsub('\\s+', '', df$code)
  cat(file = f, sprintf('%4s', gsub('\\s+', '', df$ccode)), sep = "\n")
  cbind(code = df$code, read.fwf(f, widths = rep(1, max(nchar(df$ccode)))))
}

df <- data.frame(code = c(403, 5123, 105))
f(df)
#   code V1 V2 V3 V4
# 1  403 NA  4  0  3
# 2 5123  5  1  2  3
# 3  105 NA  1  0  5

dfsep <- data.frame(code = c(" 4 0 3", "5 1 2 3", " 1 0 5"))
f(dfsep)
#      code V1 V2 V3 V4
# 1   4 0 3 NA  4  0  3
# 2 5 1 2 3  5  1  2  3
# 3   1 0 5 NA  1  0  5
rawr
  • 20,481
  • 4
  • 44
  • 78
  • Thank you @rawr - this is very interesting! I am impressed that your code can deal with arbitrarily padded input values. For my purposes, where I know I am only expecting a 3 or 4 digit value, the solution by @akrun above suffices. But your code is all in base R and I must read up on the `sprintf` function, which is evidently powerful... – Andrew Graham Oct 30 '15 at 16:06
0

A simple base R solution

codes = c(403, 5123, 105)

# make all codes the same length
l = sapply(codes, nchar)
s = strrep(' ', max(l) - l)
new_codes = paste0(s, codes)

# split and combine into matrix
res = do.call(rbind, strsplit(new_codes, ''))

Reformat as needed:

res = data.frame(code=codes, res)
colnames(res) = c('code', 'd4', 'd3', 'd2', 'd1')

Output:

  code d4 d3 d2 d1
1  403     4  0  3
2 5123  5  1  2  3
3  105     1  0  5
adn bps
  • 599
  • 4
  • 16