18

I often get datasets from collaborators that have non-consistent naming of variables/columns in the dataset. One of my first tasks is to rename them, and I want a solution completely within R to do so.

as.Given <- c("ICUDays","SexCode","MAX_of_MLD","Age.Group")

underscore_lowercase <- c("icu_days", "sex_code", "max_of_mld","age_group")

camelCase <- c("icuDays", "sexCode", "maxOfMld", "ageGroup")

Given the different opinions about naming conventions and in the spirit of what was proposed in Python, what ways are there to go from as.Given to underscore_lowercase and/or camelCase in a user-specified way in R?

Edit: Also found this related post in R / regex, especially the answer of @rengis.

Community
  • 1
  • 1
swihart
  • 2,648
  • 2
  • 18
  • 42
  • 2
    So, where are you stuck? The most difficult regex is already given in the python solution. – Roland Aug 26 '14 at 11:10
  • I would solve this by deciding that my convention will be all lowercase with no underscores or period. It is much easier, and you don't have to worry about getting input data like *icudays*, which would be next to impossible to convert to one of those formats programmatically. –  Aug 26 '14 at 11:11
  • @Roland turning the regex into a function in R. I am not sure how to translate `s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)` and `re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()` into R statements. – swihart Aug 26 '14 at 11:40
  • @dan1111 I agree that the ICUDays is an especially tricky case, that's why I included it. :-) My initial thought was to build in functionality that identified runs of consecutive capital letters, and made all but the last one in the consecutive run lower case -- or something to this effect. I do appreciate your opinion on naming and somewhat agree it is simpler but was hoping to attempt to accommodate a more readable solution in the spirit of the link posted. – swihart Aug 26 '14 at 11:45

4 Answers4

10

Try this. These at least work on the examples given:

toUnderscore <- function(x) {
  x2 <- gsub("([A-Za-z])([A-Z])([a-z])", "\\1_\\2\\3", x)
  x3 <- gsub(".", "_", x2, fixed = TRUE)
  x4 <- gsub("([a-z])([A-Z])", "\\1_\\2", x3)
  x5 <- tolower(x4)
  x5
}

underscore2camel <- function(x) {
  gsub("_(.)", "\\U\\1", x, perl = TRUE)
}

#######################################################
# test
#######################################################

u <- toUnderscore(as.Given)
u
## [1] "icu_days"   "sex_code"   "max_of_mld" "age_group" 

underscore2camel(u)
## [1] "icuDays"  "sexCode"  "maxOfMld" "ageGroup"
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • `gsub("_(.)", "\\U\\1", x, perl = TRUE)` would be changed to `gsub("_([a-z])", "\\U\\1", x, perl = TRUE)` because you should convert only the lower case letters into uppercase letters. – Avinash Raj Aug 26 '14 at 12:08
  • As [suggested](http://stackoverflow.com/questions/25504222/elegant-r-function-mixed-case-separated-by-periods-to-underscore-separated-lowe/25504488?noredirect=1#comment39814334_25504488) by swihart, your code won't work if the `as.Given` contains `"admitROM"` – Avinash Raj Aug 26 '14 at 13:13
  • Have now added code (the `x4` line) to handle that case. – G. Grothendieck Aug 26 '14 at 13:32
  • To add to the wishlist: how can we handle `as.Given = c("CRMLevel1Code", "MAX_of_RhD", "MAX_Of_MCa", "MAX_of_NCCexclusion")` to yield `underscore_lowercase = c("crm_level_1_code", "max_of_rhd", "max_of_mca", "max_of_ncc_exclusion")` ? I fear the last element is too difficult to achieve while maintaining current functionality because `ICUDays` has all caps for the acronym word `ICU` followed by capitalized word `Days` and `NCCexclusion` has all caps for the acronym word `NCC` and a lowercase word `exclusion` immediately following. Should I edit the question or start new post? Thanks. – swihart Aug 26 '14 at 15:36
  • ICUDays -> icu_days and NCCexclusion -> ncc_exclusion seem to use different rules for the same situation. Do you really want to separate out numbers? CRMLevel1Code -> crm_level1code might actually be preferable. The problem is coming up with a definitive specification of what is desired that is not inherently ambiguous. – G. Grothendieck Aug 26 '14 at 15:55
  • @swihart Updated my answer according to your new input. May i know what's the camelcase string for this lowercase `crm_level_1_code` string. – Avinash Raj Aug 27 '14 at 02:19
  • I suppose camelcase for `CRMLevel1Code` could be `crmLevel1Code`. This question and the fantastic development in answering it has revealed a couple of things: just how thorny / non-standard the input can be with respect to rules and how to treat acronym/abbreviations/strings of capital letters. – swihart Aug 27 '14 at 10:28
5

To get the second underscore_lowercase(g) and camelCase(x) strings,

> as.Given <- c("ICUDays","SexCode","MAX_of_MLD","Age.Group")
> r <- gsub("[^\\w]", "", as.Given, perl=T)
> f <- gsub("^.*?_.*$(*SKIP)(*F)|(?:[^A-Z]+|[A-Z_]+?)\\K([A-Z])(?=[A-Z_]+$|[a-z_]+$)", "_\\1", r,perl=T)
> g <- tolower(f)
> g
[1] "icu_days"   "sex_code"   "max_of_mld" "age_group"
> x <- gsub("_([a-z])", "\\U\\1", g,perl=T)
> x
[1] "icuDays"  "sexCode"  "maxOfMld" "ageGroup"

UPDATE

> as.Given = c("CRMLevel1Code", "MAX_of_RhD", "MAX_Of_MCa", "MAX_of_NCCexclusion","ICUDays","SexCode","MAX_of_MLD","Age.Group","admitRom")
> r <- gsub("[^\\w]", "", as.Given, perl=T)
> f <- gsub("(?:[^A-Z]|^)[A-Z][A-Z][A-Z]\\K(?=[a-zA-Z])|(?=\\d)|^[A-Z][a-z]+\\K(?=[A-Z][a-z]+$)|(?<=\\d)(?=[A-Za-z])|^[a-z]+\\K(?=[A-Z][a-z]+$)", "_", r, perl=T)
> underscore_lowercase <- tolower(f)
> underscore_lowercase
[1] "crm_level_1_code"     "max_of_rhd"           "max_of_mca"          
[4] "max_of_ncc_exclusion" "icu_days"             "sex_code"            
[7] "max_of_mld"           "age_group"            "admit_rom"           
> camelCase <- gsub("_([a-z]|\d)", "\\U\\1", underscore_lowercase, perl=T)
Error: '\d' is an unrecognized escape in character string starting ""_([a-z]|\d"
> camelCase <- gsub("_([a-z]|\\d)", "\\U\\1", underscore_lowercase, perl=T)
> camelCase
[1] "crmLevel1Code"     "maxOfRhd"          "maxOfMca"         
[4] "maxOfNccExclusion" "icuDays"           "sexCode"          
[7] "maxOfMld"          "ageGroup"          "admitRom" 
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
4

Based off your as.Given vector and adding admitROM to the list, this will do the trick.

as.Given <- c('ICUDays', 'SexCode', 'MAX_of_MLD', 'Age.Group', 'admitROM')
invertd <- gsub('(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|\\.', '_', as.Given, perl=T)
toscore <- tolower(invertd)
## [1] "icu_days"   "sex_code"   "max_of_mld" "age_group"  "admit_rom" 
tocamel <- gsub("_([a-z])", "\\U\\1", toscore, perl=T)
## [1] "icuDays"  "sexCode"  "maxOfMld" "ageGroup" "admitRom"
hwnd
  • 69,796
  • 4
  • 95
  • 132
2

This should do the trick:

install.packages("snakecase")
library(snakecase)

to_snake_case(as.Given)
#> [1] "icu_days"   "sex_code"   "max_of_mld" "age_group" 

to_lower_camel_case(as.Given)
#> [1] "icuDays"  "sexCode"  "maxOfMld" "ageGroup"

Githublink to snakecase package: https://github.com/Tazinho/snakecase

Taz
  • 546
  • 5
  • 9