I'm trying to wrap my head around closures, and I think I've found a case where they might be helpful.
I have the following pieces to work with:
- A set of regular expressions designed to clean state names, housed in a function
- A data.frame with state names (of the standardized form that the function above creates) and state ID codes, to link the two (the "merge map")
The idea is, given some data.frame with sloppy state names (is the capital listed as "Washington, D.C.", "washington DC", "District of Columbia", etc.?), to have a single function return the same data.frame with the state name column removed and only the state ID codes remaining. Then subsequent merges can happen consistently.
I can do this in any number of ways, but one way that seems to be particularly elegant would be to house the merge map and the regular expression and the code process everything inside a closure (following the idea that a closure is a function with data).
Question 1: Is this a reasonable idea?
Question 2: If so, how do I do it in R?
Here's a stupid simple clean state names function that works on the example data:
cleanStateNames <- function(x) {
x <- tolower(x)
x[grepl("columbia",x)] <- "DC"
x
}
Here's some example data that the eventual function will be run on:
dat <- structure(list(state = c("Alabama", "Alaska", "Arizona", "Arkansas",
"California", "Colorado", "Connecticut", "Delaware", "District of Columbia",
"Florida"), pop08 = structure(c(29L, 44L, 40L, 18L, 25L, 30L,
22L, 48L, 36L, 13L), .Label = c("1,050,788", "1,288,198", "1,315,809",
"1,316,456", "1,523,816", "1,783,432", "1,814,468", "1,984,356",
"10,003,422", "11,485,910", "12,448,279", "12,901,563", "18,328,340",
"19,490,297", "2,600,167", "2,736,424", "2,802,134", "2,855,390",
"2,938,618", "24,326,974", "3,002,555", "3,501,252", "3,642,361",
"3,790,060", "36,756,666", "4,269,245", "4,410,796", "4,479,800",
"4,661,900", "4,939,456", "5,220,393", "5,627,967", "5,633,597",
"5,911,605", "532,668", "591,833", "6,214,888", "6,376,792",
"6,497,967", "6,500,180", "6,549,224", "621,270", "641,481",
"686,293", "7,769,089", "8,682,661", "804,194", "873,092", "9,222,414",
"9,685,744", "967,440"), class = "factor")), .Names = c("state",
"pop08"), row.names = c(NA, 10L), class = "data.frame")
And a sample merge map (the actual one links FIPS codes to states, so it can't be trivially generated):
merge_map <- data.frame(state=dat$state, id=seq(10) )
EDIT Building off of crippledlambda's answer below, here's an attempt at the function:
prepForMerge <- local({
merge_map <- structure(list(state = c("alabama", "alaska", "arizona", "arkansas", "california", "colorado", "connecticut", "delaware", "DC", "florida" ), id = 1:10), .Names = c("state", "id"), row.names = c(NA, -10L ), class = "data.frame")
list(
replace_merge_map=function(new_merge_map) {
merge_map <<- new_merge_map
},
show_merge_map=function() {
merge_map
},
return_prepped_data.frame=function(dat) {
dat$state <- cleanStateNames(dat$state)
dat <- merge(dat,merge_map)
dat <- subset(dat,select=c(-state))
dat
}
)
})
> prepForMerge$return_prepped_data.frame(dat)
pop08 id
1 4,661,900 1
2 686,293 2
3 6,500,180 3
4 2,855,390 4
5 36,756,666 5
6 4,939,456 6
7 3,501,252 7
8 591,833 9
9 873,092 8
10 18,328,340 10
Two problems remain before I'd consider this question solved:
Calling
prepForMerge$return_prepped_data.frame(dat)
is painful each time. Any way to have a default function such that I could just call prepForMerge(dat)? I'm guessing not given how it's implemented, but perhaps there's at least a convention for the default fxn....How do I avoid mixing the data and code in the merge_map definition? Ideally I'd clean merge_map elsewhere, then just grab it inside the closure and store that.