How to Preprocess data to handle missing values in R

Question

I am trying to pre-process my data in R such that I can use the "attribute mean for all samples belonging to the same class as the given tuple"

The missing values or the values falling out of range have been already given a value -1 by the data source provider. But I want to replace those missing values according to the data mining principle stated above in bold. The column that is my class decider is "Accident severity" and I want to give the attribute mean for all samples belonging to the same level of accident severity as the level of severity of the tuple with the missing attribute value.

As there are multiple columns with missing values, I guess I will have to do the taskk repeatedly for all columns one at a time. What r command should I use.

There are mostly two types of data types(vectors) in my data frame.. Factor is for Date and Time columns where as integer is for most of the other columns.

Is there a way that I can upload a subset of the data set here on stack overflow?

here is the link to the reproducible data set https://drive.google.com/file/d/0B3cafW7J7xSfSkRTYWRWMHhaU2c/edit?usp=sharing

Update 2: Now that the data set is there , please help me change the values where there is a "-1" in any of the columns to a value that is the mean of all tuples that have the same value for the attribute "Accident_severity" as the tuple with the missing values..

Update 3: please ignore the colums "X2_roadclass" and "X2_Road_type" as they are mostly blank and I am dropping them. thanks

You should have a look at [this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to see how you should post data and ask question properly to maximize you chances of getting help here. — CHP, Mar 20 '14 at 10:10
It is much easier to help if you provide a **minimal, self contained example**. Please check these links for general ideas, and how to do it in R: [**here**](http://stackoverflow.com/help/mcve), [**here**](http://www.sscce.org/), and [**here**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). Also have a look at a nice [**checklist for questions on SO**](http://meta.stackexchange.com/questions/156810/stack-overflow-question-checklist). You should also show us the **code you have tried**. — Henrik, Mar 20 '14 at 10:11
I am trying to use the Dput command, it created a 100 MB file even after i selected only 10% of the records.. also when I try to read the same file using dget, the Rstudio console gets stuck reading it for more than 20 minutes without any result. Should I take only 100 records and make a file using Dput? — apTNow, Mar 20 '14 at 10:27
@apps92 : create smaller subset of your data which can be pasted here and which still represents the problem. — CHP, Mar 20 '14 at 10:32
@apps92 add a csv or txt to a public folder using a dropbox or googledocs — Paulo E. Cardoso, Mar 20 '14 at 10:33
@apps92 The idea is to summarize all 20 variables for all [Accident_Severity] levels? — Paulo E. Cardoso, Mar 20 '14 at 11:03
yes sort of.. but not only do I want to summarize, I want to replace the "-1" values with the means of such summarized output — apTNow, Mar 20 '14 at 11:33

Paulo E. Cardoso · Answer 1 · 2014-03-20T11:17:55.943

0

Please see if this is close to your need

library(ggplot2)
library(reshape)
library(plyr)

Create some data

   set.seed(1)
    df <- data.frame(severity=rep(c('high', 'moderate', 'low'), each = 3),
                     factor1 = rep(c(1,2,3), each = 6),
                     factor2 = rep(c(4,5,6), times = 3),
                     date=rep(c('2011-01-01','2011-01-03','2011-01-10'),
                           times = 3), stringsAsFactors = F)

With some -1

df$factor2[3] <- -1
df$factor1[1] <- -1

Replace them with NA

df[df == -1] <- NA

Reshape it

mdf <- melt(df, id.vars= c("severity", 'date'))

Summarize

ddply(mdf, .(severity, variable), summarise, mean=mean(value, na.rm = T))

  severity variable mean
1     high  factor1  1.6
2     high  factor2  4.8
3      low  factor1  2.5
4      low  factor2  5.0
5 moderate  factor1  2.0
6 moderate  factor2  5.0

With the data provided, I'd do something like this

dt <- read.csv('./Stackoverflow/datatry1.csv')

#head(dt[ , -c(1:3) ]) # Exclude some unwanted colums
mdt <- melt(dt[ , -c(1:3) ], id.vars= c("Accident_Severity", 'Date',
                                        'Day_of_Week', 'Time'))
dts <- ddply(mdt, .(Accident_Severity, variable), summarise,
             mean=mean(value, na.rm = T))
dts

   Accident_Severity                   variable         mean
1                  1         Number_of_Vehicles   1.00000000
2                  1            X1st_Road_Class   3.00000000
3                  1           X1st_Road_Number 503.00000000
4                  1                  Road_Type   6.00000000
5                  1                Speed_limit  30.00000000
6                  1            Junction_Detail   3.00000000
7                  1            X2nd_Road_Class  -1.00000000
...

edited Mar 20 '14 at 11:17

answered Mar 20 '14 at 10:54

Paulo E. Cardoso

5,778
32
42

I went through the documentation of Reshape package on CRAN documentation but still don't understand how the melt command works and what exactly id.vars does. – apTNow Mar 20 '14 at 11:39
@apps92 shortly, id.vars can be thought as your reference variables. All others are measured variables, always regarding your id.vars – Paulo E. Cardoso Mar 20 '14 at 11:46
is it necessary to add other columns in id.vars.. I just want to judge based on Accident_severity – apTNow Mar 20 '14 at 11:56
@apps92 I think so. The idea is to get mean summaries for each Accident_Severity. Did you check the code above with your data? Is it giving the expected results? – Paulo E. Cardoso Mar 20 '14 at 12:09
yes.. I tried the code above with my data.. I didn't omit the first 3 columns but still they never appeared in the output.. i wonder why. The output looks pretty much awesome apart from that.. But this is just the halfway I guess.. Now Do I assign the mean values instead of "-1" for all these 30+ output results or is there a programmable way? – apTNow Mar 20 '14 at 12:26
@apps92 I'm not following you. Before calculations, and to deal with -1, we replaced them with NA's. The idea is to replace -1 in the original data.frame with the mean values? – Paulo E. Cardoso Mar 20 '14 at 12:35
no, in the actual data I didn't replace them with NA. And yes, instead of "-1" in the original data, I have to replace them with the mean of that attribute for particular accident severity – apTNow Mar 20 '14 at 12:40
I am trying the following command.. Not sure if its correct.. acc3$Road_Type[which(acc3$Road_Type==-1)]<-dts$mean[which(dts$Accident_Severity==acc3$Accident_Severity&dts$variable="Road_Type")] – apTNow Mar 20 '14 at 12:49

How to Preprocess data to handle missing values in R

1 Answers1