1

I'm looking for help on how to take a long dataset and make it wide, but with some summarizing as well.

My dataset generically looks like this (I'm new to SO, so sorry if my formatting isn't up to par!):

structure(list(StateSenatorialDistrict = c(41L, 14L, 30L, 38L, 
43L, 37L, 20L, 45L, 37L, 44L), CandidateOfficeCode = structure(c(2L, 
5L, 2L, 5L, 4L, 3L, 1L, 4L, 1L, 1L), .Label = c("ATT", "AUD", 
"STH", "TRE", "USP"), class = "factor"), CandidateLastName = structure(c(4L, 
2L, 1L, 3L, 9L, 5L, 7L, 8L, 7L, 6L), .Label = c("BROWN", "CASTLE", 
"CLINTON", "DEPASQUALE", "MILLER", "RAFFERTY", "SHAPIRO", "TORSELLA ", 
"VOIT"), class = "factor"), CandidateParty = structure(c(2L, 
1L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 3L), .Label = c("CON", "DEM", 
"REP"), class = "factor"), VoteTotal = c(256L, 3L, 202L, 188L, 
18L, 402L, 251L, 383L, 156L, 761L)), .Names = c("StateSenatorialDistrict", 
"CandidateOfficeCode", "CandidateLastName", "CandidateParty", 
"VoteTotal"), row.names = c(30901L, 115192L, 41264L, 1389L, 21982L, 
29827L, 192288L, 20019L, 12803L, 60823L), class = "data.frame")

It is a dataset of precinct level vote data in Pennsylvania.

StateSenatorialDistrict CandidateOfficeCode CandidateLastName CandidateParty VoteTotal
41                 AUD        DEPASQUALE            DEM       256
14                 USP            CASTLE            CON         3
30                 AUD             BROWN            REP       202
38                 USP           CLINTON            DEM       188
43                 TRE              VOIT            REP        18
37                 STH            MILLER            DEM       402
20                 ATT           SHAPIRO            DEM       251
45                 TRE         TORSELLA             DEM       383
37                 ATT           SHAPIRO            DEM       156    
44                 ATT          RAFFERTY            REP       761

There are many more columns, but for the purposes here, these are fine.

I'd like to take this data and summarize it such that I get one line per Senatorial District with select other data on each line. The idea outcome would look like this (the data here is made up--it isn't based on the above):

StateSenatorialDistrict SenateRepLastName SenateDemLastName  SenateRepVoteTotal SenateDemVoteTotal ClintonVotes TrumpVotes
41                 BOZO            SMITH            250            300            1000            2000
42                 JOHNSON         CARTER           2012           237            1350            1000
53                 ARCHIBALD       BISHOP           350            500            5000            3000

On any given row, you know that a candidate is a senator b/c their CandidateOfficeCode is STS; you know that they are a Dem or Rep by their Party which is REP or DEM.

I know that I can aggregate data and then try to dcast it to a wide format, but that leaves me with a very wide table, with each candidate name being a column (and no information about office or party):

senateDistricts2016 <- aggregate(VoteTotal ~ StateSenatorialDistrict + CandidateOfficeCode + CandidateFirstName + CandidateLastName + CandidateParty, data=votes2016[votes2016$CandidateOfficeCode %in% c("USP", "STS"),], FUN="sum")
wideSenate <- dcast(senateDistricts2016, StateSenatorialDistrict ~ CandidateLastName)

Is there an easy way to do this? If not, thoughts about a hard way to do this?

Thanks in advance. And please let me know if my question doesn't make sense--I'm happy to edit.

EDIT:

I don't believe this is a dupe. I'm not just trying to make my data wide, but summarize it when it becomes wide, so it is not unweildy. I ended up doing a series of aggregates and then merges:

senateDs <- aggregate(VoteTotal ~ StateSenatorialDistrict + CandidateName, data=votes2016[votes2016$CandidateOfficeCode=="STS" & votes2016$CandidateParty=="DEM",], FUN="sum")
senateRs <- aggregate(VoteTotal ~ StateSenatorialDistrict + CandidateName, data=votes2016[votes2016$CandidateOfficeCode=="STS" & votes2016$CandidateParty=="REP",], FUN="sum")
senateTotalVotes <- aggregate(VoteTotal ~ StateSenatorialDistrict, data=votes2016[votes2016$CandidateOfficeCode=="STS",], FUN="sum")
senateVotes <- merge(senateDs,senateRs, by="StateSenatorialDistrict", all=TRUE)
senateVotes <- merge(senateVotes, senateTotalVotes, by="StateSenatorialDistrict", all=TRUE)

# now aggregate the presidential votes for Rs, Ds, and Total votes and combine
senatePresD <- aggregate(VoteTotal ~ StateSenatorialDistrict, data=votes2016[votes2016$CandidateOfficeCode=="USP" & votes2016$CandidateParty=="DEM",], FUN="sum")
senatePresR <- aggregate(VoteTotal ~ StateSenatorialDistrict, data=votes2016[votes2016$CandidateOfficeCode=="USP" & votes2016$CandidateParty=="REP",], FUN="sum")
senatePresTotalVotes <- aggregate(VoteTotal ~ StateSenatorialDistrict, data=votes2016[votes2016$CandidateOfficeCode=="USP",], FUN="sum")
senateVotes <- merge(senateVotes,senatePresD, by="StateSenatorialDistrict", all=TRUE)
senateVotes <- merge(senateVotes,senatePresR, by="StateSenatorialDistrict", all=TRUE)
senateVotes <- merge(senateVotes, senatePresTotalVotes, by="StateSenatorialDistrict", all=TRUE)


setnames(senateVotes, c("StateSenatorialDistrict", "DCandidate","DVotes","RCandidate","RVotes", "TotalSenatorVotes", "PresDVotes", "PresRVotes","TotalPresVotes"))
Thaqor
  • 49
  • 7
  • I don't actually agree that this is a duplicate. I noted dcast as a potential solution, but it doesn't get me where I want since each of the candidate variations now show up as a column. I was hoping to summarize all of the candidates of each party into generic "D Candidate" and "R Candidate" columns. This is helpful to me since each senatorial district is going to only have one D and one R candidate. – Thaqor Feb 21 '17 at 01:22
  • I was able to solve my problem through a series of aggregate and then merge commands. – Thaqor Feb 21 '17 at 01:58

0 Answers0