0

I want to get county names associated with all zip codes in a dataset. I'm getting relatively fast results with a data frame (although I feel it can be done much faster) but not so with a data.table, even with some optimization. Is there a way to further speed it up using data frames or data.tables?

Here is my initialization (based on this answer):

library(noncensus)
data(zip_codes)
data(counties)
counties$fips <- as.numeric(paste0(counties$state_fips, counties$county_fips))

Computation with data frames (second is slightly faster, as expected) - 20, 16 seconds:

system.time(sapply(zip_codes$fips, function(x) subset(counties, fips == x)$county_name))
system.time(sapply(zip_codes$fips, function(x) counties[counties$fips==x,]$county_name))

Computation with data tables - 60, 43 seconds:

zip_codes.dt <- data.table(zip_codes)
counties.dt <- data.table(zip_codes)
system.time(sapply(zip_codes.dt$fips, function(x) subset(counties.dt, fips == x)$county_name))
setkey(counties.dt, fips)  # optimizing
system.time(sapply(zip_codes.dt$fips, function(x) counties.dt[.(x)]$county_name))
Community
  • 1
  • 1
Naumz
  • 481
  • 5
  • 15
  • Your first threes lines, the construction of `counties$fips`, could be obtained in a single line with `counties$fips <- interaction(counties$state_fips, counties$county_fips)`. – lmo Jan 23 '17 at 21:04
  • 1
    @lmo That introduces a `.` in the value of the factor variable, which doesn't match with the naming in `zip_codes` for `fips`. – Naumz Jan 23 '17 at 21:09
  • 2
    `interaction` has a sep argument that is set to "." as default. use sep="" to get rid of it. – lmo Jan 23 '17 at 21:11
  • 1
    We can just use `paste` i.e. `paste0(counties$state_fips, counties$county_fips)` and it will be a `character` vector – akrun Jan 23 '17 at 21:16
  • 1
    Most of what you're doing here is very inconsistent with standard data.table syntax. Maybe run through the vignettes first. For starters: `counties.dt[.(x)]$county_name` should use `, county_name` instead of `$`; and I doubt that there's any reason to use `sapply` instead of doing a single join there... – Frank Jan 23 '17 at 21:57
  • @Frank thank you! They were very useful reads. I got my answer. – Naumz Jan 24 '17 at 00:01

1 Answers1

2

Based on @Frank's suggestion, reading the vignettes here and the package documentation helped me find the answer using data.table.

Here it is:

zip_codes.dt[counties.dt, on="fips", county_name := county_name]
Naumz
  • 481
  • 5
  • 15