0

Here is my dataset: https://wiki.csc.calpoly.edu/datasets/attachment/wiki/HighwayAccidents/ACCIDENT2007-FullDataSet.csv

I cleaned up some data of all car accidents in the US and provinces (1:56) in 2007, and have a large csv file with nine variables, such as State, Vehicles, Pedestrians, Persons, Drunk Drivers, Fatalities, Date, and Time. The CSV lists each accident as a separate row. The states are identified numerically. I would like to sum various columns per state without summing the state.

I would like to have a result like:

State        Drunk_Dr
1               345
2              1023

or

State       Fatalities   Drunk_Dr
34              123        134
35               56         64

etc. for 1:56

dan1st
  • 12,568
  • 8
  • 34
  • 67
jtarrou
  • 9
  • 2
  • 3
    Hopefully you don't expect people to willingly download 6.2MB of data just to answer one question. – Rich Scriven Sep 22 '14 at 05:02
  • 1
    Check out [how to make a great R reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for tips on creating a *minimal* reproducible example. Focus only on what's necessary to illustrate your particular problem. – MrFlick Sep 22 '14 at 05:29
  • ##Here is my dataset: https://wiki.csc.calpoly.edu/datasets/attachment/wiki/HighwayAccidents/ACCIDENT2007-FullDataSet.csv – jtarrou Sep 22 '14 at 05:47
  • @ jtarrou: see this post: http://stackoverflow.com/questions/25975920/error-cannot-allocate-vector-of-size-1-7-gb-in-r – rnso Sep 22 '14 at 15:19

1 Answers1

0
library(data.table)

file.in <- "path/to/your/file.csv"
DT.accidents <- fread(file.in)

## Have a look at the different DRUNK_DR values
DT.accidents[, table(DRUNK_DR)]
## Nine?? Really?  

DT.accidents[DRUNK_DR == 9]


## Anyway, to sum up by state and drunk drivers, assuming one row of data is one accident, you can simply use: 

DT.accidents[, .N, by=list(STATE, DRUNK_DR)]


## If you want to ignore cases with zero drunk drivers, filter those out
DT.drunks <- DT.accidents[DRUNK_DR > 0, .N, by=list(STATE, DRUNK_DR)]

## You can reshape it too, if you'd like

library(reshape2)
DT.drunks <- as.data.table(dcast(DT.drunks, STATE ~ DRUNK_DR, value="N"))

Adding in state names

State Names, accorindg to 
ftp://ftp.nhtsa.dot.gov/FARS/FARS-DOC/USERGUIDE-2007.pdf


## start with the built in variable 'state.name' (no "s")
state_names <- state.name[1:50]
## Add in territories
state_names <- sort(c(state_names, "District of Columbia", "Puerto Rico", "Virgin Islands"))
## Create index numbers that match what is shown in the file
state_number <- setdiff(1:56, c(3, 7, 14))
## Create a data.table for joining
DT.states <- data.table(state_number=state_number, state_names=state_names)

## Join in the info
setkey(DT.states, "state_number")
setkey(DT.accidents, "STATE")
DT.accidents[DT.states, STATE_NAMES := state_names]

## Now you can reshape, but include the names
DT.drunks <- DT.accidents[DRUNK_DR > 0, .N, by=list(STATE, STATE_NAMES, DRUNK_DR)]

## You can reshape it too, if you'd like
DT.drunks <- as.data.table(dcast(DT.drunks, STATE + STATE_NAMES ~ DRUNK_DR, value="N"))

Now... as for that nine-drunk driver accident.

DT.accidents[DRUNK_DR == 9]

Googling: "Montana May 19 2007 Traffic Fatality" The first result leads to http://www.city-data.com/accidents/acc-Browning-Montana.html

Which has this piece of information

Browning fatal car crashes and road traffic accidents (wrecks) list for 2007: May 19, 2007 05:55 PM, Us-2, Sr-464, Lat: 48.555692, Lon: -113.010247, Vehicles: 1, Fatalities: 1, Drunken drivers: Inconsistent data

Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • This is great. Now I need to go try to build a function to select for certain state and the sums of various variables for that state. – jtarrou Sep 22 '14 at 06:11
  • @jitarrou, you probably dont need to "build" any functions. Most of what you need you can do via the `, by=...` argument in data.table. Have a look at the vignette: http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf – Ricardo Saporta Sep 22 '14 at 06:14