-1

Have the below dataframe where all the columns are factors which I want to use them as numeric columns. I tried different ways but it is changing to different values when I try as.numeric(as.character(.))

The data comes in a semicolon separated format. A subset of data to reproduce the problem is:

rawData <- "Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
21/12/2006;11:23:00;?;?;?;?;?;?;
21/12/2006;11:24:00;?;?;?;?;?;?;
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000
"
hpc <- read.csv(text=rawData,sep=";")
str(hpc)

When run against the full data file after dropping the date and time variables, the output from str() looks like:

> str(hpc)
'data.frame':   2075259 obs. of  7 variables:
 $ Global_active_power  : Factor w/ 4187 levels "?","0.076","0.078",..: 2082 2654 2661 2668 1807 1734 1825 1824 1808 1805 ...
 $ Global_reactive_power: Factor w/ 533 levels "?","0.000","0.046",..: 189 198 229 231 244 241 240 240 235 235 ...
 $ Voltage              : Factor w/ 2838 levels "?","223.200",..: 992 871 837 882 1076 1010 1017 1030 907 894 ...
 $ Global_intensity     : Factor w/ 222 levels "?","0.200","0.400",..: 53 81 81 81 40 36 40 40 40 40 ...
 $ Sub_metering_1       : Factor w/ 89 levels "?","0.000","1.000",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Sub_metering_2       : Factor w/ 82 levels "?","0.000","1.000",..: 3 3 14 3 3 14 3 3 3 14 ...
 $ Sub_metering_3       : num  17 16 17 17 17 17 17 17 17 16 ...

Can anyone help me in getting the expected output?

expected output:

 > str(hpc)
'data.frame':   2075259 obs. of  7 variables:
 $ Global_active_power  : num  "?","0.076","0.078",..: 2082 2654 2661 2668 1807 1734 1825 1824 1808 1805 ...
 $ Global_reactive_power: num  "?","0.000","0.046",..: 189 198 229 231 244 241 240 240 235 235 ...
 $ Voltage              : num  "?","223.200",..: 992 871 837 882 1076 1010 1017 1030 907 894 ...
 $ Global_intensity     : num  "?","0.200","0.400",..: 53 81 81 81 40 36 40 40 40 40 ...
 $ Sub_metering_1       : num  "?","0.000","1.000",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Sub_metering_2       : num  "?","0.000","1.000",..: 3 3 14 3 3 14 3 3 3 14 ...
 $ Sub_metering_3       : num  17 16 17 17 17 17 17 17 17 16 ...
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
vinay karagod
  • 256
  • 1
  • 3
  • 18
  • @RonakShah not helping it is changing to different values – vinay karagod Jan 03 '18 at 02:29
  • oops..sorry try `lapply(hpc, function(x) as.numeric(as.character(x)))` – Ronak Shah Jan 03 '18 at 02:31
  • @RonakShah this is not a duplicate question as I already tried with that its not working – vinay karagod Jan 03 '18 at 02:35
  • yup..I thought so..but you need to provide the `dput` in such cases to get what kind of data you have. – Ronak Shah Jan 03 '18 at 02:36
  • 1
    This question is better asked as "how do I correctly read data of various data types from a CSV file?" If you read the data with the correct types in the first place, they don't need to be converted from factor to numeric. – Len Greski Jan 03 '18 at 03:36
  • 1
    @RonakShah / @akrun - The "root cause" of the problem in the OP is the fact that the missing values are `?` in this data set, which causes `read.csv()` to parse all columns as factors because the default for the `stringsAsFactors=` argument is `TRUE`. Would one of you reopen the question so I can post an answer that addresses the root cause of why these columns are parsed as factors? – Len Greski Jan 03 '18 at 12:46

1 Answers1

2

Not able to test your data frame, but hopefully this will work. I notice that in the output of str(hpc) not all columns are factors. mutate_if can apply a function to those meet the requirement of a predictive function.

library(dplyr)

hpc2 <- hpc %>%
    mutate_if(is.factor, funs(as.numeric(as.character(.))))
www
  • 38,575
  • 12
  • 48
  • 84
  • 1
    This is another Johns Hopkins Data Science Specialization assignment. I didn't downvote, but a better answer for this specific question is to read the data with the correct column types with `readr::read_csv()` or `read_csv2()` using the `col_types=` argument, given the data source. Of course, since the OP didn't post the original data, you don't have this context. – Len Greski Jan 03 '18 at 03:33
  • @LenGreski Thanks for your comments. – www Jan 03 '18 at 03:51