I want to use the Census' county-adjacency data, but am was stuck on getting it into a nice form. The data comes in four columns: first county, first code, second county, second code. The first county column does not repeat, and instead takes value "" the way I have it read in now:
c1 cd1 c2 cd2
1 Alamance County, NC 37001 Alamance County, NC 37001
2 NA Caswell County, NC 37033
3 NA Chatham County, NC 37037
4 NA Guilford County, NC 37081
5 NA Orange County, NC 37135
6 NA Randolph County, NC 37151
7 NA Rockingham County, NC 37157
8 Alexander County, NC 37003 Alexander County, NC 37003
9 NA Caldwell County, NC 37027
10 NA Catawba County, NC 37035
11 NA Iredell County, NC 37097
12 NA Wilkes County, NC 37193
13 Alleghany County, NC 37005 Alleghany County, NC 37005
14 NA Ashe County, NC 37009
15 NA Surry County, NC 37171
16 NA Wilkes County, NC 37193
17 NA Grayson County, VA 51077
18 Anson County, NC 37007 Anson County, NC 37007
19 NA Montgomery County, NC 37123
20 NA Richmond County, NC 37153
I happen to only be interested in the North Carolina part of the data found at that link, part of which is what you see above:
#
nc_cc <- structure(list(c1 = c("Alamance County, NC", "", "", "", "", "", "", "Alexander County, NC", "", "", "", "", "Alleghany County, NC", "", "", "", "", "Anson County, NC", "", ""), cd1 = c(37001L, NA, NA, NA, NA, NA, NA, 37003L, NA, NA, NA, NA, 37005L, NA, NA, NA, NA, 37007L, NA, NA), c2 = c("Alamance County, NC", "Caswell County, NC", "Chatham County, NC", "Guilford County, NC", "Orange County, NC", "Randolph County, NC", "Rockingham County, NC", "Alexander County, NC", "Caldwell County, NC", "Catawba County, NC", "Iredell County, NC", "Wilkes County, NC", "Alleghany County, NC", "Ashe County, NC", "Surry County, NC", "Wilkes County, NC", "Grayson County, VA", "Anson County, NC", "Montgomery County, NC", "Richmond County, NC" ), cd2 = c(37001L, 37033L, 37037L, 37081L, 37135L, 37151L, 37157L, 37003L, 37027L, 37035L, 37097L, 37193L, 37005L, 37009L, 37171L, 37193L, 51077L, 37007L, 37123L, 37153L)), .Names = c("c1", "cd1", "c2", "cd2"), row.names = c(NA, 20L), class = "data.frame")
#
I want a clean adjacency association out of this (and the county names are superfluous), so my desired output could take a variety of forms: a data.frame, a list, ...
The crude solution I've come up with (after a lot of thought) is this:
require(data.table)
DT <- data.table(nc_cc)
DT[,list(cd1=cd1[1],cd2),by=cumsum(!is.na(cd1))][,list(cd1,cd2)]
giving
cd1 cd2
1: 37001 37001
2: 37001 37033
3: 37001 37037
4: 37001 37081
5: 37001 37135
6: 37001 37151
7: 37001 37157
8: 37003 37003
9: 37003 37027
10: 37003 37035
11: 37003 37097
12: 37003 37193
13: 37005 37005
14: 37005 37009
15: 37005 37171
16: 37005 37193
17: 37005 51077
18: 37007 37007
19: 37007 37123
20: 37007 37153
I've tagged this with data.table
because I used it in my solution above, and I suspect there's something nice that can be done with roll
. Really, I've never understood the documentation for roll
, so I'm hoping to learn something here... So: can this be done a better way?
EDIT: This question is asking the same thing, so I revise my question to: "Is there a better way to do this using data.table
or base R (since I am averse to installing more packages)?"