0

I work with data sets from different sources. Often different sources will come with different names for the same variable. For example, from source A a variable will be called "Year" but from Source B it will be named "YEAR". Or "Revenue" in A and "TOT.REV" in B. I want to write a script that renames variables across all the different sources and gives them consistent names.

Right now I am doing this with dplyr::rename, but given the large number of variables it is pretty monstrous and unwieldly to manage. What I have now looks like this

rename(df, Year = YEAR, Month = MONTH, Revenue = TOT.REV, ...)

Except I loop it over a list containing all my data frames from various sources.

The problem is, if I ever want to change anything, I have to maintain this enormous list within the script. Not very convenient, and also very ugly. I would like to do it with a reference table that I could maintain in a spreadsheet somewhere. Do you know of any slick way of doing this? Thanks in advance. :)

sam
  • 67
  • 7
  • Functions that aren't exact mappings will be your friend - i.e., maybe your first step should be `tolower()` or something like that to get everything in the same case, maybe with `make.names()` if you have to worry about spaces or other nonstandard characters. This will make your code flexible so that `TOT.REV`, `TOT.rev` and `tot rev` don't each have to be explicitly coded. – Gregor Thomas Aug 30 '16 at 22:15
  • Personally I don't see how maintaining a data frame or named vector in R (and maybe an accompanying function) is any harder than maintaining a spreadsheet (and an accompanying function), but sure - you could use a spreadsheet. What's stopping you? Just do a left join or a match between your current data names and your reference names. – Gregor Thomas Aug 30 '16 at 22:16
  • "Just do a left join or a match between your current data names and your reference names." yes! that's exactly what I want! but, er... how do I do that? (sorry) :'( – sam Aug 30 '16 at 22:27
  • 1
    Where are you stuck? Have a look at the examples at the bottom of `?match` or `?merge` or `?dplyr::left_join`. I don't know how to help more with what you've given us... If you make a [reproducible example](http://stackoverflow.com/q/5963269/903061) - create a small lookup table and a couple sample data frames (they don't even need rows - just column names) and give it a go. You can probably figure it out. If you get stuck, post your example and what you tried and make sure you're clear about your expected output. – Gregor Thomas Aug 30 '16 at 22:35
  • Your question states you are using databases. Could you write custom views to rename the columns in the databases? – Benjamin Aug 30 '16 at 23:04
  • Alternatively, you could write a package to facilitate your imports. You could build your reference table in a CSV, then import the data and save it to `sysdata.rda` which will be accessible to your package code if saved in the `R` directory. – Benjamin Aug 30 '16 at 23:06
  • @Gregor Thanks, I figured out something that does the trick. It's not pretty but better than what I had. :) – sam Aug 31 '16 at 00:02

2 Answers2

1

You can use rename.vars from gdata:

fromnames <- c("speed", "dist")
tonames <- c("Velocity", "Distance")

cars_new <- gdata::rename.vars(cars, fromnames, tonames)

You can keep those two vectors of names in a CSV file or whatever.

Remko Duursma
  • 2,741
  • 17
  • 24
  • giving me 'some of the from names not found in df' error message. and when i restrict it to int(fromnames, names(df)) i get 'from and to not same length' error message. since I have multiple data sources, there are necessarily fromnames not in tonames – sam Aug 31 '16 at 19:25
  • if you have the `fromnames` and `tonames` in a dataframe, you could do `subset(namesdf, fromnames %in% names(df))`, and apply those in `rename.vars` – Remko Duursma Sep 02 '16 at 01:22
0

A simple way to do this is to maintain a name lookup table that defines the mapping from names in your data sets to a common set of reference names. This table will have two columns:

  1. A name column for the names in your data sets.
  2. A reference column for the corresponding reference names to which you want to map.

For example, this name lookup table can be maintained as a CSV file named names.table.csv:

   name, reference
   Year, YEAR
  Month, MONTH
Revenue, Total Revenue
   YEAR, YEAR
  MONTH, MONTH
TOT.REV, Total Revenue

This CSV file can then be read into R:

names.table <- read.csv("/path/to/names.table.csv", header=TRUE, strip.white=TRUE, stringsAsFactors=FALSE)
print(names.table)
##     name     reference
##1    Year          YEAR
##2   Month         MONTH
##3 Revenue Total Revenue
##4    YEAR          YEAR
##5   MONTH         MONTH
##6 TOT.REV Total Revenue

It is important that we specify strip.white=TRUE and stringsAsFactors=FALSE to remove all leading and trailing white spaces and to maintain the entries as character strings.

Then, assuming that we have data sets (as data frames) that have column names that are in the name column of the names.table, for example:

df1 <- data.frame(Year = c(2014, 2014), Month=c("Jan", "Feb"), Revenue=c(124.5, 123.5), stringsAsFactors=FALSE)
print(df1)
##  Year Month Revenue
##1 2014   Jan   124.5
##2 2014   Feb   123.5
df2 <- data.frame(YEAR = c(2015, 2015), MONTH=c("Jan", "Feb"), TOT.REV=c(154.5, 132.5), stringsAsFactors=FALSE)
print(df2)
##  YEAR MONTH TOT.REV
##1 2015   Jan   154.5
##2 2015   Feb   132.5
df3 <- data.frame(YEAR = c(2016, 2016), Month=c("Jan", "Feb"), TOT.REV=c(154.5, 132.5), stringsAsFactors=FALSE)
print(df3)
##  YEAR Month TOT.REV
##1 2016   Jan   154.5
##2 2016   Feb   132.5

mapping those names can be performed by:

colnames(df1) <- names.table$reference[match(colnames(df1), names.table$name)]
print(df1)
##  YEAR MONTH Total Revenue
##1 2014   Jan         124.5
##2 2014   Feb         123.5
colnames(df2) <- names.table$reference[match(colnames(df2), names.table$name)]
print(df2)
##  YEAR MONTH Total Revenue
##1 2015   Jan         154.5
##2 2015   Feb         132.5
colnames(df3) <- names.table$reference[match(colnames(df3), names.table$name)]
print(df3)
##  YEAR MONTH Total Revenue
##1 2016   Jan         154.5
##2 2016   Feb         132.5

Hope this helps.

aichao
  • 7,375
  • 3
  • 16
  • 18