4

I want to use the Census' county-adjacency data, but am was stuck on getting it into a nice form. The data comes in four columns: first county, first code, second county, second code. The first county column does not repeat, and instead takes value "" the way I have it read in now:

                     c1   cd1                    c2   cd2
1   Alamance County, NC 37001   Alamance County, NC 37001
2                          NA    Caswell County, NC 37033
3                          NA    Chatham County, NC 37037
4                          NA   Guilford County, NC 37081
5                          NA     Orange County, NC 37135
6                          NA   Randolph County, NC 37151
7                          NA Rockingham County, NC 37157
8  Alexander County, NC 37003  Alexander County, NC 37003
9                          NA   Caldwell County, NC 37027
10                         NA    Catawba County, NC 37035
11                         NA    Iredell County, NC 37097
12                         NA     Wilkes County, NC 37193
13 Alleghany County, NC 37005  Alleghany County, NC 37005
14                         NA       Ashe County, NC 37009
15                         NA      Surry County, NC 37171
16                         NA     Wilkes County, NC 37193
17                         NA    Grayson County, VA 51077
18     Anson County, NC 37007      Anson County, NC 37007
19                         NA Montgomery County, NC 37123
20                         NA   Richmond County, NC 37153

I happen to only be interested in the North Carolina part of the data found at that link, part of which is what you see above:

#
nc_cc <- structure(list(c1 = c("Alamance County, NC", "", "", "", "", "", "", "Alexander County, NC", "", "", "", "", "Alleghany County, NC", "", "", "", "", "Anson County, NC", "", ""), cd1 = c(37001L, NA, NA, NA, NA, NA, NA, 37003L, NA, NA, NA, NA, 37005L, NA, NA, NA, NA, 37007L, NA, NA), c2 = c("Alamance County, NC", "Caswell County, NC", "Chatham County, NC", "Guilford County, NC", "Orange County, NC", "Randolph County, NC", "Rockingham County, NC", "Alexander County, NC", "Caldwell County, NC", "Catawba County, NC", "Iredell County, NC", "Wilkes County, NC", "Alleghany County, NC", "Ashe County, NC", "Surry County, NC", "Wilkes County, NC", "Grayson County, VA", "Anson County, NC", "Montgomery County, NC", "Richmond County, NC" ), cd2 = c(37001L, 37033L, 37037L, 37081L, 37135L, 37151L, 37157L, 37003L, 37027L, 37035L, 37097L, 37193L, 37005L, 37009L, 37171L, 37193L, 51077L, 37007L, 37123L, 37153L)), .Names = c("c1", "cd1", "c2", "cd2"), row.names = c(NA, 20L), class = "data.frame")
#

I want a clean adjacency association out of this (and the county names are superfluous), so my desired output could take a variety of forms: a data.frame, a list, ...

The crude solution I've come up with (after a lot of thought) is this:

require(data.table)
DT <- data.table(nc_cc)
DT[,list(cd1=cd1[1],cd2),by=cumsum(!is.na(cd1))][,list(cd1,cd2)]

giving

      cd1   cd2
 1: 37001 37001
 2: 37001 37033
 3: 37001 37037
 4: 37001 37081
 5: 37001 37135
 6: 37001 37151
 7: 37001 37157
 8: 37003 37003
 9: 37003 37027
10: 37003 37035
11: 37003 37097
12: 37003 37193
13: 37005 37005
14: 37005 37009
15: 37005 37171
16: 37005 37193
17: 37005 51077
18: 37007 37007
19: 37007 37123
20: 37007 37153

I've tagged this with data.table because I used it in my solution above, and I suspect there's something nice that can be done with roll. Really, I've never understood the documentation for roll, so I'm hoping to learn something here... So: can this be done a better way?

EDIT: This question is asking the same thing, so I revise my question to: "Is there a better way to do this using data.table or base R (since I am averse to installing more packages)?"

Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180
  • Cool! I got a downvote; haven't had one of those since my first (quickly retracted) question. – Frank Sep 28 '13 at 16:10
  • If anyone knows of an elegant non-`cumsum` solution, it would be appreciated. Some day, I might come back and place a bounty. For now, I'm marking @eddi's as the answer, even though it is clear that he didn't read the question (which -- you know -- contains that answer). – Frank Sep 29 '13 at 03:29
  • 2
    Although the answers to the question you linked to require the zoo package, it offers some good functionality, like `maxgap` and `fromLast`, which may come in handy. And there's no reason you can't use `na.locf` with `data.table`. i.e. `DT[, cd1:=na.locf(cd1)]` – GSee Sep 29 '13 at 13:22
  • @GSee: Thanks for the pointers; maybe I should just install that. I see that it has no dependencies and is kept up-to-date. I just want to avoid (i) learning to use and (ii) using stuff that might not work a few years from now. – Frank Sep 29 '13 at 20:24
  • @Downvoters: If there's a problem with this question, you could just tell me. I can edit, close or delete it if you can convince me of what's wrong with it... – Frank Sep 29 '13 at 20:26
  • @Frank. I agree they should have said why the downvote. Looking at it myself: i) the question appears to depend on a link (that link might break in future). You sometimes see answers that contain a link downvoted for the same reason. ii) North Carolina isn't present in the extract of data displayed. If it was it'd be easier to understand the question. iii) you've asked for a better way but haven't defined "better" so it's hard to answer. Are you looking for faster? For more elegant? – Matt Dowle Sep 30 '13 at 10:21
  • @MatthewDowle Thanks, that makes sense. I was looking for something more elegant, since `cumsum(!...)` and `cumsum(diff(...))` have always felt like hacks and been hard for me to read. Also, I was sort of hoping for an illustration of what `roll` is used for. I think I'll just take the downvotes for this q and use what I've learned on the next one. (To any downvoters, in my defense: it actually doesn't depend on the link, since example data is also given; 'NC' is there, if not 'North Carolina'; yeah, just saying "better" is pretty bad; also, including unneeded columns, c1 and c2...) – Frank Sep 30 '13 at 11:23

2 Answers2

11

A pretty standard way of doing this is:

library(data.table)
dt = data.table(nc_cc)

dt[, cd1 := cd1[1], by = cumsum(!is.na(cd1))]
eddi
  • 49,088
  • 6
  • 104
  • 155
0

I found a roll solution based on an answer of @Arun's!

In my application, it's much more convoluted than the cumsum answer that @eddi (...and I, in stating the question) used:

DT <- data.table(nc_cc)
setkey(DT[,i:=.I],i)

DT[
    DT[c1!=""][J(1:20),roll=TRUE][,list(c1,cd1),key=i],
    `:=`(c1=i.c1,cd1=i.cd1)
]

I learned the i.name thing from from @eddi's answer to my other question.

Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180