1

I'm facing a problem that I cannot fix. I have not a lot experience in R so any help will be welcome.

Okay so I have a variable called supermarkt2015_afst which is the distance to the supermarket. It goes from 0.1 towards 99.9. km. However the variable contains some MISSING values.

Since the fields are literally called "MISSING" and I don't want anything above 100km I clean it by;

datamodel2 <- datamodel2 %>% mutate(supermarkt2015_afst=ifelse(supermarkt2015_afst=="MISSING",NA,supermarkt2015_afst))

datamodel2 <- datamodel2 %>% mutate(supermarkt2015_afst=ifelse(supermarkt2015_afst>100,NA,supermarkt2015_afst))
datamodel2 <- datamodel2 %>% filter(!is.na(supermarkt2015_afst))

This cleans indeed the MISSING values and throws away roughly a third of my observations. Which is exactly what I wanted. HOWEVER. The variable now goes from 2-99. Not like 2.1 2.2 2.3, only round numbers. Where did it all go wrong? It says my variable is Integer when I run "typeof". What i want is have the same accuracy as before on 1 decimal. But for some reason it alters my variable as well.


One variable in my data looks like

dput(head(datamodel2$supermarkt2015_afst)) 

structure(c(3L, 3L, 3L, 3L, 3L, 3L), .Label = c("0,1", "0,2", "0,3", "0,4", "0,5",
 "0,6", "0,7", "0,8", "0,9", "1,0", "1,1", "1,2", "1,3", "1,4", 
"1,5", "1,6", "1,7", "1,8", "1,9", "10,0", .............. "6,4", "6,5", "6,6",
 "6,7", "6,8", "6,9", "7,0", "7,1", "7,2", "7,3", "7,4", "7,5", "7,6", "7,7", 
"7,8", "7,9", "8,0", "8,1", "8,2", "8,3", "8,6", "8,9", "9,0", "9,1", "9,2", 
"9,3", "9,6", "MISSING"), class = "factor")
user20650
  • 24,654
  • 5
  • 56
  • 91
Thundersheep
  • 45
  • 1
  • 8
  • You accidentally converted this field to a factor when you read it in, and the numbers have been replaced by the integer factor codes. How are you reading in your data? See the `na` argument to `readxl::read_excel` or `readr::read_csv` or `na.strings` for `read.csv` – Ben Bolker Oct 22 '17 at 16:16
  • # Read CSV into R cbsdata <- read.csv(file="data2015.csv", header=TRUE, sep=";") attach(cbsdata) – Thundersheep Oct 22 '17 at 16:18
  • Is that what you meant? And how can I fix it? Do you have any ideas? – Thundersheep Oct 22 '17 at 16:21
  • Can you post the output of `dput(head(cbsdata, 20))` in your question? – Rui Barradas Oct 22 '17 at 16:22
  • That is a lot to paste in. What are you looking for? – Thundersheep Oct 22 '17 at 16:27
  • I ran for the variable itself since the output was HUGE. This is before cleaning. Left some numbers out as it 2 long > dput(head(datamodel2$supermarkt2015_afst)) structure(c(3L, 3L, 3L, 3L, 3L, 3L), .Label = c("0,1", "0,2", "0,3", "0,4", "0,5", "0,6", "0,7", "0,8", "0,9", "1,0", "1,1", "1,2", "1,3", "1,4", "1,5", "1,6", "1,7", "1,8", "1,9", "10,0", .............. "6,4", "6,5", "6,6", "6,7", "6,8", "6,9", "7,0", "7,1", "7,2", "7,3", "7,4", "7,5", "7,6", "7,7", "7,8", "7,9", "8,0", "8,1", "8,2", "8,3", "8,6", "8,9", "9,0", "9,1", "9,2", "9,3", "9,6", "MISSING"), class = "factor") – Thundersheep Oct 22 '17 at 16:33
  • After cleaning it looks like. dput(head(datamodel2$supermarkt2015_afst)) c(3L, 3L, 3L, 3L, 3L, 3L) sorry for the posts, but doesnt fit in one. – Thundersheep Oct 22 '17 at 16:35
  • as Ben Bolker suggests, you could use the `na.strings` argument when reading in , so try `cbsdata <- read.csv(file="data2015.csv", header=TRUE, sep=";", na.strings=c("", "MISSING"), stringsAsFactors=FALSE)` – user20650 Oct 22 '17 at 16:35
  • I now did it load in as Ben Bolker suggested. And yes this replaces my "missing" with NA. But this does not solve the problem. Because now I can't clean my data. Because when I run datamodel2 <- datamodel2 %>% mutate(supermarkt2015_afst=ifelse(supermarkt2015_afst>100,NA,supermarkt2015_afst)) datamodel2 <- datamodel2 %>% filter(!is.na(supermarkt2015_afst)) I end up with 0 observations Warning message: In Ops.factor(supermarkt2015_afst, 100) : ‘>’ not meaningful for factors – Thundersheep Oct 22 '17 at 16:47
  • oh, I just noticed from your comment, your data is using commas for the decimal separator - so also use `dec=","` in `read.csv` – user20650 Oct 22 '17 at 16:52
  • However cleaning the NA's worked. I still have the decimals. It's just. I can't cut off all variables >100 cause factor.... – Thundersheep Oct 22 '17 at 16:52
  • I think that did the trick man. I'm going to run some checks to be sure first. – Thundersheep Oct 22 '17 at 16:58
  • Yeah it seems to work. You are a hero. And Ben thanks too and for everyone who have helped. One more thing though. I cannot see how many NA's i have per variable. this is important for me so i know how many observations i lose per adding variable. I run > sum(is.na(datamodel2$ziekenhuis_incl2015_afst)) [1] 94627 > sum(is.na(datamodel2$supermarkt2015_afst)) [1] 94627 but those are the total NA's. not per variable. Anyone? – Thundersheep Oct 22 '17 at 17:11
  • 1
    https://stackoverflow.com/questions/8317231/elegant-way-to-report-missing-values-in-a-data-frame – user20650 Oct 22 '17 at 17:12
  • `sapply(df, function(x) table(is.na(x))[["TRUE"]])` Number of NAs per column or variable. – Sowmya S. Manian Oct 22 '17 at 17:23
  • I get > sapply(datamodel2$supermarkt2015_afst, function(x) table(is.na(x))[["TRUE"]]) Error in table(is.na(x))[["TRUE"]] : subscript out of bounds > – Thundersheep Oct 22 '17 at 17:30
  • 1
    @Thundersheep ; you dont use `sapply` directly on a single variable, you use it on the dataset - it loops across each variable. So use `sapply(datamodel2, ...` . But please look at the link i gave there are better ways to do this. – user20650 Oct 22 '17 at 17:35
  • 1
    yes (e.g. `sapply(datamodel2, function(x) sum(is.na(x))`) – Ben Bolker Oct 22 '17 at 17:36
  • Yeah okay ran it like Ben said now. That works. But it gives the same output for all variables. Or well half of them say 0, so nothing missing and those who miss are all 67.195 – Thundersheep Oct 22 '17 at 17:43
  • Never mind think I found a solution. Anyway great thanks to you all! – Thundersheep Oct 22 '17 at 17:44

0 Answers0