I am using the programming language R and have a data set containing traffic crash information with the columns listed below. In all, there are 25 variables with 69,523 observations. However, it should be noted that the variable CrashFatalies was generated by the following line of code:
df$CrashFatalities <- ifelse(df$FatalInjuries > 0, TRUE, FALSE)
Ultimately, my goal is to create a logistic regression model using CrashFatalities as the target variable to predict the likelihood that a car accident will result in at least one fatality.
Because the data contains mixed data, my plan was to apply FAMD to reduce the number of variables to be used in the logistics regression model.
My question is: should one hot encoding be used on the categorical/factor variables and if so, at which point in the analysis? (i.e. BEFORE applying FAMD or after or do I have to at all?) I am worried that if I one hot encode the categorical variables before that I am going to have a memory issue from having so many more variables added to the data set.
And yes, I understand that there still may be some data cleaning do to, such as creating a dummy variable for CrashFatalities and likely removing FatalInjuries from the list of features in the logistic model as that is no longer independent. I have been unsuccessful in finding much documentation using FAMD and this will be my first attempt in applying it to any dataset and I am really hung up on how to proceed from here. I am fairly new to the data analytics/science world and am trying to use this project as practice and a learning experience, so I apologise if I am leaving out any pertinent information that would be useful in answering my question and am more than happy to elaborate on anything if necessary.
$ MinorInjuries : int 0 0 0 2 1 0 0 1 1 0 ...
$ ModerateInjuries : int 0 0 1 0 0 0 0 0 0 0 ...
$ SevereInjuries : int 0 0 0 0 0 0 0 0 0 0 ...
$ FatalInjuries : int 0 0 0 0 0 0 0 0 0 0 ...
$ PedestrianAction : Factor w/ 11 levels "Approaching/Leaving
School Bus",..: 6 6 6 2 6
$ RoadwaySurface : Factor w/ 5
levels "Dry","Slippery (Muddy Oily etc.)",..: 5 5 1 1 1
$ RoadwayCondition : Factor w/ 9 levels "Construction - Repair
Zone",..: 5 5 5 5 5 5 5 5
$ Lighting : Factor w/ 6
levels "Dark - No Street Light",..: 4 4 2 2 4 2 2 2 2 2
$ PrimaryCollisionFactor : Factor w/ 9 levels "Bike At Fault",..: 8
8 8 8 8 7 7 8 8 8 ...
$ TrafficControl : Factor w/ 5
levels "Controls Functioning",..: 4 4 4 4 4 1 1 1 1 4
$ Weather : Factor w/ 8 levels "Clear","Cloudy",..: 5 5 1 2 1 1 5 1 1 1 ...
$ CollisionType : Factor w/ 9 levels "Broadside","Head
On",..: 3 5 2 9 7 2 6 6 6 4
$ ProximityToIntersection : Factor w/
4 levels "Driveway","Intersection",..: 3 3 3 2 3 4 4 2 2
$ VehicleInvolvedWith : Factor w/ 18 levels "Animal","Bike",..: 3
3 3 12 3 10 10 10 10 11
$ Sex : Factor w/ 3
levels "F","M","Unknown": 1 1 2 2 2 2 2 2 2 2 ...
$ Age : int 24 25 66 50 31 73 32 29 30 33 ...
$ VehicleDamage : Factor w/ 7 levels "Major","Minor",..: 1 6 1 2 1 7 7 1 3 2 ...
$ Sobriety : Factor w/ 9 levels "Had Been Drinking -
Impairment Unknown",..: 4 4
$ MovementPrecedingCollision: Factor w/
22 levels "Backing","Changing Lanes",..: 2 17 18 5 16 20
$ PartyType : Factor w/ 20 levels "Bicycle","Bus -
Other",..: 4 4 4 4 4 4 4 4 4 4
$ ViolationCodeDescription : Factor
w/ 80 levels "Bad Brakes","Bald Tires",..: 54 54 54 54 54 61
$ Month : Factor w/ 12 levels "1","2","3","4",..: 3
2 2 3 5 2 12 2 2 12 ...
$ Day : Factor w/ 31
levels "1","2","3","4",..: 1 2 10 3 25 26 19 23 23 20
$ Year : Factor w/ 10 levels "2011","2012",..: 8 6 6 6 7 3 4 3 3 3 ...
$ CrashFatalities : logi FALSE FALSE FALSE FALSE FALSE FALSE
...
First 4 rows of data:
structure(list(MinorInjuries = c(0L, 0L, 0L, 2L), ModerateInjuries = c(0L,
0L, 1L, 0L), SevereInjuries = c(0L, 0L, 0L, 0L), FatalInjuries = c(0L,
0L, 0L, 0L), PedestrianAction = structure(c(6L, 6L, 6L, 2L), .Label = c("Approaching/Leaving School Bus",
"Crossing - Not In Crosswalk", "Crossing In Crosswalk - At Intersection",
"Crossing In Crosswalk - Not At Intersection", "In Road - Includes Shoulder",
"No Pedestrians Involved", "Not In Road", "Other", "Running/Jogging",
"Unknown", "Walking"), class = "factor"), RoadwaySurface = structure(c(5L,
5L, 1L, 1L), .Label = c("Dry", "Slippery (Muddy Oily etc.)",
"Snowy - Icy", "Unknown", "Wet"), class = "factor"), RoadwayCondition = structure(c(5L,
5L, 5L, 5L), .Label = c("Construction - Repair Zone", "Flooded",
"Holes Deep Rut", "Loose Material On Roadway", "No Unusual Conditions",
"Obstruction On Roadway", "Other", "Reduced Roadway Width", "Unknown"
), class = "factor"), Lighting = structure(c(4L, 4L, 2L, 2L), .Label = c("Dark - No Street Light",
"Dark - Street Light", "Dark - Street Light Not Functioning",
"Daylight", "Dusk - Dawn", "Unknown"), class = "factor"), PrimaryCollisionFactor = structure(c(8L,
8L, 8L, 8L), .Label = c("Bike At Fault", "Fell Asleep", "Other Improper Driving",
"Other Than Driver", "Parked/Rolling", "Pedestrian At Fault",
"Unknown", "Violation Driver 1", "Violation Driver 2"), class = "factor"),
TrafficControl = structure(c(4L, 4L, 4L, 4L), .Label = c("Controls Functioning",
"Controls Not Functioning", "Controls Obscured", "No Controls Present/Factor",
"Unknown"), class = "factor"), Weather = structure(c(5L,
5L, 1L, 2L), .Label = c("Clear", "Cloudy", "Fog", "Other",
"Rain", "Snow", "Unknown", "Wind"), class = "factor"), CollisionType = structure(c(3L,
5L, 2L, 9L), .Label = c("Broadside", "Head On", "Hit Object",
"Other", "Overturned", "Rear End", "Sideswipe", "Vehicle/Bike",
"Vehicle/Pedestrian"), class = "factor"), ProximityToIntersection = structure(c(3L,
3L, 3L, 2L), .Label = c("Driveway", "Intersection", "Non-Related",
"Related"), class = "factor"), VehicleInvolvedWith = structure(c(3L,
3L, 3L, 12L), .Label = c("Animal", "Bike", "Fixed Object",
"Ice Cream Truck", "Light Rail Vehicle", "Motor Vehicle On Other Roadway",
"Motorcycle", "Non-Collision", "Other Object", "Other Vehicle",
"Parked Vehicle", "Pedestrian", "Scooter Motorized", "Scooter Non-Motorized",
"Skateboard", "Train", "Unknown", "Wheelchair"), class = "factor"),
Sex = structure(c(1L, 1L, 2L, 2L), .Label = c("F", "M", "Unknown"
), class = "factor"), Age = c(24L, 25L, 66L, 50L), VehicleDamage = structure(c(1L,
6L, 1L, 2L), .Label = c("Major", "Minor", "Moderate", "None",
"Not Applicable", "Totaled", "Unknown"), class = "factor"),
Sobriety = structure(c(4L, 4L, 3L, 4L), .Label = c("Had Been Drinking - Impairment Unknown",
"Had Been Drinking - Not Under Influence", "Had Been Drinking - Under Influence",
"Had Not Been Drinking", "Impairment Not Known", "Impairment Physical",
"Not Applicable", "Sleepy/Fatigued", "Under Drug Influence"
), class = "factor"), MovementPrecedingCollision = structure(c(2L,
17L, 18L, 5L), .Label = c("Backing", "Changing Lanes", "Crossing Into Opposing Lane",
"Entering Traffic", "Making Left Turn", "Making Right Turn",
"Making U-Turn", "Merging", "Other", "Other (Bike)", "Other (Ped)",
"Other Unsafe Turning", "Parked", "Parking Maneuver", "Passing Other Vehicles", "Proceeding Straight", "Ran Off Road", "Slowing/Stopping",
"Stalled", "Stopped", "Traveling Wrong Way", "Unknown"), class = "factor"),
PartyType = structure(c(4L, 4L, 4L, 4L), .Label = c("Bicycle",
"Bus - Other", "Bus - School", "Car", "Car With Trailer",
"Construction Equipment", "Emergency Vehicle", "Ice Cream Truck",
"Light Rail Vehicle", "Motorcycle/Moped", "Other", "Panel Truck",
"Pedestrian", "Scooter Motorized", "Scooter Non-Motorized",
"Semi Truck", "Skateboard", "Train", "Unknown", "Wheelchair"
), class = "factor"), ViolationCodeDescription = structure(c(54L,
54L, 54L, 54L), .Label = c("Bad Brakes", "Bald Tires", "Bike Over Centerline",
"Child Safety/Belt", "Condition/Brakes", "Crossing Controlled Intersection/Jaywalking",
"Curb Parking", "Divided Highway", "Driving Drunk", "Driving Drunk <21",
"Driving Drunk With Injury", "Driving In Bike Lane", "Driving On Left",
"Driving On Sidewalk", "Driving Over Centerline", "Driving Wrong Side",
"Duration/Signal", "Fail Stop/Sign", "Failure To Go On Green/Arrow",
"Fell Asleep", "Flashing Signal", "Follow Too Closely", "Going Wrong Way",
"Improper Turn", "Laws Applied To Bike", "Laws Apply To Motorized Scooter",
"Leave Accident Scene", "Minimum Speed Law", "Minor In Back Of Pick Up",
"No Head Lights", "Not Applicable", "Obey Traffic Control",
"Obstructed View/U-Turn", "Open Door Traffic", "Other Improper Driving",
"Overhead Crossing", "Parking On Railroad", "Parking Unlawfully",
"Pass", "Pass On Right", "Passing", "Passing On Left", "Pedestrian Don't Walk",
"Pedestrian On Roadway", "Pedestrian Yield Car", "Railroad Crossing",
"Reckless Driving", "Reckless Driving 1", "Reckless Driving 2",
"Regulation Of Turns At Intersection", "Right Of Way/Sidewalk",
"Run Red Light", "Speed Contest", "Speeding", "Spilling Load",
"Stop Suddenly", "Tire Tread Depth", "Towed Vehicle", "U-Turn Business District",
"Unattended Vehicle", "Unknown", "Unlawful Riding", "Unlicensed Driver",
"Unsafe Backing", "Unsafe Lane Change", "Unsafe Tow", "Unsafe Turn Movement",
"Unsafe U-Turn", "Vehicle Stopped/Pedestrian", "Vehicle Unsafe",
"Wrong Way Bike", "Yield Bike-Crosswalk", "Yield Emergency Vehicle",
"Yield For Pass", "Yield From Driveway/Curb", "Yield Left Turn",
"Yield Pedestrian In Crosswalk", "Yield Stop Sign", "Yield Uncontrolled Intersection",
"Yield Yield Sign"), class = "factor"), Month = structure(c(3L,
2L, 2L, 3L), .Label = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", "11", "12"), class = "factor"), Day = structure(c(1L,
2L, 10L, 3L), .Label = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14", "15", "16", "17",
"18", "19", "20", "21", "22", "23", "24", "25", "26", "27",
"28", "29", "30", "31"), class = "factor"), Year = structure(c(8L,
6L, 6L, 6L), .Label = c("2011", "2012", "2013", "2014", "2015",
"2016", "2017", "2018", "2019", "2020"), class = "factor"),
CrashFatalities = c(FALSE, FALSE, FALSE, FALSE)), row.names = c(2L,
3L, 4L, 7L), class = "data.frame")