How to complete several character vector formatting steps in a single function?

Question

EDITED

I have a simple list of column names that I would like to change the format of, ideally programmatically. This is a sample of the list:

    vars_list <- c("tBodyAcc.mean...X", "tBodyAcc.mean...Y", "tBodyAcc.mean...Z",
    "tBodyAcc.std...X", "tBodyAcc.std...Y", "tBodyAcc.std...Z", 
    "tGravityAcc.mean...X", "tGravityAcc.mean...Y", "tGravityAcc.mean...Z",
    "tGravityAcc.std...X", "tGravityAcc.std...Y", "tGravityAcc.std...Z",
    "fBodyAcc.mean...X", "fBodyAcc.mean...Y", "fBodyAcc.mean...Z", 
    "fBodyAcc.std...X", "fBodyAcc.std...Y", "fBodyAcc.std...Z",
    "fBodyAccJerk.mean...X", "fBodyAccJerk.mean...Y", "fBodyAccJerk.mean...Z",
    "fBodyAccJerk.std...X", "fBodyAccJerk.std...Y", "fBodyAccJerk.std...Z")

And this is the result I'm hoping for:

 [3]"Time_Body_Acc_Mean_X"                "Time_Body_Acc_Mean_Y"               
 [5] "Time_Body_Acc_Mean_Z"                "Time_Body_Acc_Stddev_X"             
 [7] "Time_Body_Acc_Stddev_Y"              "Time_Body_Acc_Stddev_Z"             
 [9] "Time_Gravity_Acc_Mean_X"             "Time_Gravity_Acc_Mean_Y"            
[11] "Time_Gravity_Acc_Mean_Z"             "Time_Gravity_Acc_Stddev_X"          
[13] "Time_Gravity_Acc_Stddev_Y"           "Time_Gravity_Acc_Stddev_Z"

...

[43] "Freq_Body_Acc_Mean_X"                "Freq_Body_Acc_Mean_Y"               
[45] "Freq_Body_Acc_Mean_Z"                "Freq_Body_Acc_Stddev_X"             
[47] "Freq_Body_Acc_Stddev_Y"              "Freq_Body_Acc_Stddev_Z"             
[49] "Freq_Body_Acc_Jerk_Mean_X"           "Freq_Body_Acc_Jerk_Mean_Y"          
[51] "Freq_Body_Acc_Jerk_Mean_Z"           "Freq_Body_Acc_Jerk_Stddev_X"        
[53] "Freq_Body_Acc_Jerk_Stddev_Y"         "Freq_Body_Acc_Jerk_Stddev_Z"

I've put together what feels like a really verbose way of making the changes employing regular expressions.

vars_list <- unlist(lapply(vars_list, function(x){gsub("^t", "Time", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("^f", "Freq", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("std", "Stddev", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("mean", "Mean", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("\\.+", "", x)}))
vars_list <- unlist(lapply(vars_list, function(x){gsub("\\.", "", x)}))
vars_list <- unlist(lapply(vars_list, 
                           function(x){gsub("(?<=[a-z]).{0}(?=[A-Z])",
                                            "_", x, perl = TRUE)}))

Is there a way to arrive at the same results more efficiently and elegantly by including two or more formatting steps in a single function call?

Well, `gsub` is vectorized, so you can get rid of `unlist(lapply(function(x)...))`, instead of `unlist(lapply(vars_list, function(x){gsub("^t", "Time", x)}))` use `gsub("^t", "Time", vars_list)`. Other than that, I think your code is fine. — Gregor Thomas, Dec 05 '17 at 04:11
Or just do it in a quick loop - https://stackoverflow.com/questions/26171318/regex-for-preserving-case-pattern-capitalization/26171700 — thelatemail, Dec 05 '17 at 04:32

score 3 · Accepted Answer · answered Dec 05 '17 at 04:24

3

One alternative is to write your patterns and replacement in two vectors, then use stringi::stri_replace_all_regex which can do this replacement in a vectorized manner:

# patterns correspond to replacement at the same positions
patterns <- c('^t', '^f', 'std', 'mean', '\\.+', '(?<=[a-z])([A-Z])')
replacement <- c('Time', 'Freq', 'Stddev', 'Mean', '', '_$1')

library(stringi)
stri_replace_all_regex(vars_list, patterns, replacement, vectorize_all = F)
# [1] "Time_Body_Acc_Mean_X"      "Time_Body_Acc_Mean_Y"     
# [3] "Time_Body_Acc_Mean_Z"      "Time_Body_Acc_Stddev_X"   
# [5] "Time_Body_Acc_Stddev_Y"    "Time_Body_Acc_Stddev_Z"   
# [7] "Time_Gravity_Acc_Mean_X"   "Time_Gravity_Acc_Mean_Y"  
# [9] "Time_Gravity_Acc_Mean_Z"   "Time_Gravity_Acc_Stddev_X"
#[11] "Time_Gravity_Acc_Stddev_Y" "Time_Gravity_Acc_Stddev_Z"

answered Dec 05 '17 at 04:24

Psidom

209,562
33
339
356

Can you explain this string in `replacement`: '_$1'? – Conner M. Dec 05 '17 at 05:16
`$1` is back reference in ICU's regular expression which `stringi` package use. Basically it replaces `$1` in `_$1` with what is captured in pattern `(?<=[a-z])([A-Z])`, i.e. the first upper letter after a lower letters pattern. – Psidom Dec 05 '17 at 15:31
I see. I think the pattern needs to be modified to `(?<=[a-z])(?=[A-Z]) since the idea is to insert between the lower and upper case letters and not replace the uppercase letter. – Conner M. Dec 05 '17 at 16:17
1

Yep, that works as well and seems a simpler solution. In which case, you don't need `$1` in replacement anymore; Simply `_` should work. – Psidom Dec 05 '17 at 16:20

Maurits Evers · Answer 2 · 2017-12-05T04:51:44.377

0

How about this using base R's sub?

sub("t(\\w+)(Acc)\\.(\\w+)\\.+([XYZ])", "Time_\\1_\\2_\\3_\\4", vars_list);
#[1] "Time_Body_Acc_mean_X"    "Time_Body_Acc_mean_Y"
#[3] "Time_Body_Acc_mean_Z"    "Time_Body_Acc_std_X"
#[5] "Time_Body_Acc_std_Y"     "Time_Body_Acc_std_Z"
#[7] "Time_Gravity_Acc_mean_X" "Time_Gravity_Acc_mean_Y"
#[9] "Time_Gravity_Acc_mean_Z" "Time_Gravity_Acc_std_X"
#[11] "Time_Gravity_Acc_std_Y"  "Time_Gravity_Acc_std_Z"

Changing mean to Mean, and std to StdDev requires two additional subs. Ditto for t to Time and f to Freq.

edited Dec 05 '17 at 04:51

answered Dec 05 '17 at 04:24

Maurits Evers

49,617
4
47
68

`^f` for `Freq` is missing too. – wp78de Dec 05 '17 at 04:38
@wp78de There is no `Freq` in OP's minimal example string. – Maurits Evers Dec 05 '17 at 04:39
Look closer at the OP's list of the gsub()s. Psidom's solution includes it. – wp78de Dec 05 '17 at 04:44
@wp78de Again, there is no `Freq` in OP's example string (look even closer;-). At best, I agree that the example is *not representative*, based on OP's code attempt. Either way, it doesn't change anything. Including `Freq` requires another couple of `sub`s. But I agree, Psidom's solution is neater, albeit at the expense of an additional library. My solution is base R only. – Maurits Evers Dec 05 '17 at 04:50
I left out examples that begin with "f". An oversight. – Conner M. Dec 05 '17 at 04:58
No worries @ConnerM. I'd recommend taking a look at Psidom's solution, which is very concise and elegant. – Maurits Evers Dec 05 '17 at 04:58
Regardless of its applicability in this particular case @ MauritsEvers, your response has given me some insight into the functionality of `sub` that I didn't possess previously. – Conner M. Dec 05 '17 at 05:10

How to complete several character vector formatting steps in a single function?

2 Answers2