0

I am using the programming language R and have a data set containing traffic crash information with the columns listed below. In all, there are 25 variables with 69,523 observations. However, it should be noted that the variable CrashFatalies was generated by the following line of code:

df$CrashFatalities <- ifelse(df$FatalInjuries > 0, TRUE, FALSE)

Ultimately, my goal is to create a logistic regression model using CrashFatalities as the target variable to predict the likelihood that a car accident will result in at least one fatality.

Because the data contains mixed data, my plan was to apply FAMD to reduce the number of variables to be used in the logistics regression model.

My question is: should one hot encoding be used on the categorical/factor variables and if so, at which point in the analysis? (i.e. BEFORE applying FAMD or after or do I have to at all?) I am worried that if I one hot encode the categorical variables before that I am going to have a memory issue from having so many more variables added to the data set.

And yes, I understand that there still may be some data cleaning do to, such as creating a dummy variable for CrashFatalities and likely removing FatalInjuries from the list of features in the logistic model as that is no longer independent. I have been unsuccessful in finding much documentation using FAMD and this will be my first attempt in applying it to any dataset and I am really hung up on how to proceed from here. I am fairly new to the data analytics/science world and am trying to use this project as practice and a learning experience, so I apologise if I am leaving out any pertinent information that would be useful in answering my question and am more than happy to elaborate on anything if necessary.

$ MinorInjuries             : int  0 0 0 2 1 0 0 1 1 0 ...  
$ ModerateInjuries          : int  0 0 1 0 0 0 0 0 0 0 ...  
$ SevereInjuries            : int  0 0 0 0 0 0 0 0 0 0 ...  
$ FatalInjuries             : int  0 0 0 0 0 0 0 0 0 0 ...  
$ PedestrianAction          : Factor w/ 11 levels "Approaching/Leaving
   School Bus",..: 6 6 6 2 6   
$ RoadwaySurface            : Factor w/ 5
   levels "Dry","Slippery (Muddy Oily etc.)",..: 5 5 1 1 1   
$ RoadwayCondition          : Factor w/ 9 levels "Construction - Repair
   Zone",..: 5 5 5 5 5 5 5 5   
$ Lighting                  : Factor w/ 6 
   levels "Dark - No Street Light",..: 4 4 2 2 4 2 2 2 2 2   
$ PrimaryCollisionFactor    : Factor w/ 9 levels "Bike At Fault",..: 8
   8 8 8 8 7 7 8 8 8 ...  
$ TrafficControl            : Factor w/ 5
   levels "Controls Functioning",..: 4 4 4 4 4 1 1 1 1 4   
$ Weather : Factor w/ 8 levels "Clear","Cloudy",..: 5 5 1 2 1 1 5 1 1 1 ...  
$ CollisionType             : Factor w/ 9 levels "Broadside","Head
   On",..: 3 5 2 9 7 2 6 6 6 4   
$ ProximityToIntersection   : Factor w/
   4 levels "Driveway","Intersection",..: 3 3 3 2 3 4 4 2 2   
$ VehicleInvolvedWith       : Factor w/ 18 levels "Animal","Bike",..: 3
   3 3 12 3 10 10 10 10 11   
$ Sex                       : Factor w/ 3
   levels "F","M","Unknown": 1 1 2 2 2 2 2 2 2 2 ...  
$ Age           : int  24 25 66 50 31 73 32 29 30 33 ... 
$ VehicleDamage           : Factor w/ 7 levels "Major","Minor",..: 1 6 1 2 1 7 7 1 3 2 ...  
$ Sobriety                  : Factor w/ 9 levels "Had Been Drinking -
   Impairment Unknown",..: 4 4   
$ MovementPrecedingCollision: Factor w/
   22 levels "Backing","Changing Lanes",..: 2 17 18 5 16 20   
$ PartyType                 : Factor w/ 20 levels "Bicycle","Bus -
   Other",..: 4 4 4 4 4 4 4 4 4 4   
$ ViolationCodeDescription  : Factor
   w/ 80 levels "Bad Brakes","Bald Tires",..: 54 54 54 54 54 61   
$ Month                     : Factor w/ 12 levels "1","2","3","4",..: 3
   2 2 3 5 2 12 2 2 12 ...  
$ Day                       : Factor w/ 31
   levels "1","2","3","4",..: 1 2 10 3 25 26 19 23 23 20   
$ Year   : Factor w/ 10 levels "2011","2012",..: 8 6 6 6 7 3 4 3 3 3 ...
$ CrashFatalities           : logi  FALSE FALSE FALSE FALSE FALSE FALSE
...

First 4 rows of data:

structure(list(MinorInjuries = c(0L, 0L, 0L, 2L), ModerateInjuries = c(0L, 
0L, 1L, 0L), SevereInjuries = c(0L, 0L, 0L, 0L), FatalInjuries = c(0L, 
0L, 0L, 0L), PedestrianAction = structure(c(6L, 6L, 6L, 2L), .Label = c("Approaching/Leaving School Bus", 
"Crossing - Not In Crosswalk", "Crossing In Crosswalk - At Intersection", 
"Crossing In Crosswalk - Not At Intersection", "In Road - Includes Shoulder", 
"No Pedestrians Involved", "Not In Road", "Other", "Running/Jogging", 
"Unknown", "Walking"), class = "factor"), RoadwaySurface = structure(c(5L, 
5L, 1L, 1L), .Label = c("Dry", "Slippery (Muddy Oily etc.)", 
"Snowy - Icy", "Unknown", "Wet"), class = "factor"), RoadwayCondition = structure(c(5L, 
5L, 5L, 5L), .Label = c("Construction - Repair Zone", "Flooded", 
"Holes Deep Rut", "Loose Material On Roadway", "No Unusual Conditions", 
"Obstruction On Roadway", "Other", "Reduced Roadway Width", "Unknown"
), class = "factor"), Lighting = structure(c(4L, 4L, 2L, 2L), .Label = c("Dark - No Street Light", 
"Dark - Street Light", "Dark - Street Light Not Functioning", 
"Daylight", "Dusk - Dawn", "Unknown"), class = "factor"), PrimaryCollisionFactor = structure(c(8L, 
8L, 8L, 8L), .Label = c("Bike At Fault", "Fell Asleep", "Other Improper Driving", 
"Other Than Driver", "Parked/Rolling", "Pedestrian At Fault", 
"Unknown", "Violation Driver 1", "Violation Driver 2"), class = "factor"), 
    TrafficControl = structure(c(4L, 4L, 4L, 4L), .Label = c("Controls Functioning", 
    "Controls Not Functioning", "Controls Obscured", "No Controls Present/Factor", 
    "Unknown"), class = "factor"), Weather = structure(c(5L, 
    5L, 1L, 2L), .Label = c("Clear", "Cloudy", "Fog", "Other", 
    "Rain", "Snow", "Unknown", "Wind"), class = "factor"), CollisionType = structure(c(3L, 
    5L, 2L, 9L), .Label = c("Broadside", "Head On", "Hit Object", 
    "Other", "Overturned", "Rear End", "Sideswipe", "Vehicle/Bike", 
    "Vehicle/Pedestrian"), class = "factor"), ProximityToIntersection = structure(c(3L, 
    3L, 3L, 2L), .Label = c("Driveway", "Intersection", "Non-Related", 
    "Related"), class = "factor"), VehicleInvolvedWith = structure(c(3L, 
    3L, 3L, 12L), .Label = c("Animal", "Bike", "Fixed Object", 
    "Ice Cream Truck", "Light Rail Vehicle", "Motor Vehicle On Other Roadway", 
    "Motorcycle", "Non-Collision", "Other Object", "Other Vehicle", 
    "Parked Vehicle", "Pedestrian", "Scooter Motorized", "Scooter Non-Motorized", 
    "Skateboard", "Train", "Unknown", "Wheelchair"), class = "factor"), 
    Sex = structure(c(1L, 1L, 2L, 2L), .Label = c("F", "M", "Unknown"
    ), class = "factor"), Age = c(24L, 25L, 66L, 50L), VehicleDamage = structure(c(1L, 
    6L, 1L, 2L), .Label = c("Major", "Minor", "Moderate", "None", 
    "Not Applicable", "Totaled", "Unknown"), class = "factor"), 
    Sobriety = structure(c(4L, 4L, 3L, 4L), .Label = c("Had Been Drinking - Impairment Unknown", 
    "Had Been Drinking - Not Under Influence", "Had Been Drinking - Under Influence", 
    "Had Not Been Drinking", "Impairment Not Known", "Impairment Physical", 
    "Not Applicable", "Sleepy/Fatigued", "Under Drug Influence"
    ), class = "factor"), MovementPrecedingCollision = structure(c(2L, 
    17L, 18L, 5L), .Label = c("Backing", "Changing Lanes", "Crossing Into Opposing Lane", 
    "Entering Traffic", "Making Left Turn", "Making Right Turn", 
    "Making U-Turn", "Merging", "Other", "Other (Bike)", "Other (Ped)", 
    "Other Unsafe Turning", "Parked", "Parking Maneuver", "Passing Other Vehicles",  "Proceeding Straight", "Ran Off Road", "Slowing/Stopping", 
    "Stalled", "Stopped", "Traveling Wrong Way", "Unknown"), class = "factor"), 
    PartyType = structure(c(4L, 4L, 4L, 4L), .Label = c("Bicycle", 
    "Bus - Other", "Bus - School", "Car", "Car With Trailer", 
    "Construction Equipment", "Emergency Vehicle", "Ice Cream Truck", 
    "Light Rail Vehicle", "Motorcycle/Moped", "Other", "Panel Truck", 
    "Pedestrian", "Scooter Motorized", "Scooter Non-Motorized", 
    "Semi Truck", "Skateboard", "Train", "Unknown", "Wheelchair"
    ), class = "factor"), ViolationCodeDescription = structure(c(54L, 
    54L, 54L, 54L), .Label = c("Bad Brakes", "Bald Tires", "Bike Over Centerline", 
    "Child Safety/Belt", "Condition/Brakes", "Crossing Controlled Intersection/Jaywalking", 
    "Curb Parking", "Divided Highway", "Driving Drunk", "Driving Drunk <21", 
    "Driving Drunk With Injury", "Driving In Bike Lane", "Driving On Left", 
    "Driving On Sidewalk", "Driving Over Centerline", "Driving Wrong Side", 
    "Duration/Signal", "Fail Stop/Sign", "Failure To Go On Green/Arrow", 
    "Fell Asleep", "Flashing Signal", "Follow Too Closely", "Going Wrong Way", 
    "Improper Turn", "Laws Applied To Bike", "Laws Apply To Motorized Scooter", 
    "Leave Accident Scene", "Minimum Speed Law", "Minor In Back Of Pick Up", 
    "No Head Lights", "Not Applicable", "Obey Traffic Control", 
    "Obstructed View/U-Turn", "Open Door Traffic", "Other Improper Driving", 
    "Overhead Crossing", "Parking On Railroad", "Parking Unlawfully", 
    "Pass", "Pass On Right", "Passing", "Passing On Left", "Pedestrian Don't Walk", 
    "Pedestrian On Roadway", "Pedestrian Yield Car", "Railroad Crossing", 
    "Reckless Driving", "Reckless Driving 1", "Reckless Driving 2", 
    "Regulation Of Turns At Intersection", "Right Of Way/Sidewalk", 
    "Run Red Light", "Speed Contest", "Speeding", "Spilling Load", 
    "Stop Suddenly", "Tire Tread Depth", "Towed Vehicle", "U-Turn Business District", 
    "Unattended Vehicle", "Unknown", "Unlawful Riding", "Unlicensed Driver", 
    "Unsafe Backing", "Unsafe Lane Change", "Unsafe Tow", "Unsafe Turn Movement", 
    "Unsafe U-Turn", "Vehicle Stopped/Pedestrian", "Vehicle Unsafe", 
    "Wrong Way Bike", "Yield Bike-Crosswalk", "Yield Emergency Vehicle", 
    "Yield For Pass", "Yield From Driveway/Curb", "Yield Left Turn", 
    "Yield Pedestrian In Crosswalk", "Yield Stop Sign", "Yield Uncontrolled Intersection", 
    "Yield Yield Sign"), class = "factor"), Month = structure(c(3L, 
    2L, 2L, 3L), .Label = c("1", "2", "3", "4", "5", "6", "7", 
    "8", "9", "10", "11", "12"), class = "factor"), Day = structure(c(1L, 
    2L, 10L, 3L), .Label = c("1", "2", "3", "4", "5", "6", "7", 
    "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", 
    "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", 
    "28", "29", "30", "31"), class = "factor"), Year = structure(c(8L, 
    6L, 6L, 6L), .Label = c("2011", "2012", "2013", "2014", "2015", 
    "2016", "2017", "2018", "2019", "2020"), class = "factor"), 
    CrashFatalities = c(FALSE, FALSE, FALSE, FALSE)), row.names = c(2L, 
3L, 4L, 7L), class = "data.frame")
Amyxdomz
  • 1
  • 1
  • Is this "traffic crash information" publicly available (i.e. where could we find it)? Or are you able to edit your question to include the output from the command `dput(head(df))`? Without a [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) it's difficult to answer your question. For further information, please see [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – jared_mamrot May 16 '22 at 23:25
  • 1
    it depends on the function/package you are using, if you were only doing a glm, this would be done for you for character/categorical variables: `glm(mpg ~ factor(gear), mtcars)`, or you can do this yourself: `model.matrix( ~ factor(gear), mtcars)`, there are many questions on here about how to encode variables like this, but the answer will depend on the specific function you use – rawr May 17 '22 at 00:25
  • @jared_mamrot I just edited the post to include the first 4 rows of the data set using the how-to guide you provided. I hope I did that correctly, if not, please let me know and I will try again – Amyxdomz May 17 '22 at 00:44
  • @rawr I do plan on using glm() to create the logistic model – Amyxdomz May 17 '22 at 00:47

0 Answers0