0

I am a beginner data science student in health sciences. I am attempting to clean my dataset before utilizing it for analysis.

I have beginner experience in R and need some assistance in converting a string to a numeric value so I can conduct analysis on the variable.

in the publicly available data, there is a character variable in which it asks people's perception on the health care system on a Likert scale but the way its coded in the dataset is "1 - terrible; 2; 3; 4;... 10 - Excellent"

All I want to do is:
1) Convert "1 - terrible" to just "1" and same with 10.
2) I would also like to omit all the "Don'tknow/refused" -- to remove this from my denominator.

I did some initial searching and I found some functions (strsplit) but I'm having difficulties applying it to my situation

  • Hi, and welcome to SO! Please take a look at [how to ask](https://stackoverflow.com/help/how-to-ask) and also how to provide a good [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) as this makes providing answers much easier. In this case, if you have tried using `strsplit` but did not succeed, what was the error? – Calum You Jan 22 '19 at 00:23
  • It will help if you provide a few rows of your dataset, showing all values that the response can take. Or point us to data online, since it is publicly-available. – neilfws Jan 22 '19 at 00:27
  • Thank you for the welcome, and response. I did not code anything yet, as I am not really too sure how to use the function of strsplit. I am using R Markdown, and usually, before I code, I read up on how I can understand a piece of code and then applying it. However, I am failing to understand it, and how to use that function. I have loaded my dataset, created summeries and tables for each variable. However cleaning is where i get stuck. Any advise or help appreciated. – helpmeimagradstudent Jan 22 '19 at 00:28
  • neilfws, thank you good point, it is located here: https://open.canada.ca/data/dataset/3eac6c30-4e06-4441-a84b-8019786ae69c And its variable: Q2 and Q3 to which i am trying to convert from character to numeric. – helpmeimagradstudent Jan 22 '19 at 00:30

5 Answers5

1

Welcome to SO! You should check out this Help page with a few hints on how to make your questions easier to answer. Notably, you should provide a proper example. It can be daunting but if you managed to find str_split then you are clearly capable of digging deeper. I'd advise you to go for one of the very accessible free intros to R.

# This is the bare minimum you should provide us with

likert <- c("1 - terrible", "2 - bad", 
            "3 - average", "4 - good", "5 - excellent", "Don't know")


# This seems to be what you're attempting
library(stringr)

likert_numeric <- as.numeric(str_extract(string = likert, pattern = "\\d")) 
# str_extract will take out the first occurrence of the pattern in the string, still as a string
# \\d tells R to look for one digit

likert_numeric
#> [1] 1 2 3 4 5 NA

# But perhaps you just want to code the variable as a factor, 
# which will tell R to treat it appropriately in statistical settings
likert_factor <- as.factor(likert)

likert_factor
#> [1] 1 - terrible  2 - bad       3 - average   4 - good      5 - excellent
#> Levels: 1 - terrible 2 - bad 3 - average 4 - good 5 - excellent

You may want to play around with the numeric version just to get some quick and dirty results; but in the long run, you want to know what factors are and how to use them.

EDIT: As to ignoring the NA value, you'll need to tell us what you're trying to do. Many functions in R have an attribute to ignore NA values ( na.rm = TRUE ) but it may or may not be suitable.

Fons MA
  • 1,142
  • 1
  • 12
  • 21
0
df$yourcol<-as.integer(gsub("\\D","",df$yourcol))
iod
  • 7,412
  • 2
  • 17
  • 36
0

Minor modification to @FonsMA answer since it would trim double digits (i.e. 10). The following should help.

txt <- data.frame(character = c("1 - terrible","2 - awful", "3 - bad", "4 - not 
good", "5 - umm", "6 - OK", "7 - good", "8 - great", "9 - fantastic", "10-excellent"),
code = 0)

library(stringr)
txt$code <- as.numeric(str_extract(string = txt$character, pattern = "[0-9]*"))

For your actual use case, I would simply create the extra variable in your data frame and then use str_extract.

You could do something like:

YOURDATAFRAME$newCol <- 0
YOURDATAFRAME$newCol <- as.numeric(str_extract(string = YOURDATAFRAME$STRCOL, pattern = "[0-9]*"))  
Nick
  • 276
  • 2
  • 13
0

If you want to do "things with data frames", it's worth getting to know dplyr.

You can grab the dataset straight from the Web:

library(readr)
library(dplyr)

cdn_attitudes <- read_csv("http://www.hc-sc.gc.ca/data-donnees/por-rop/cdn-attitudes-healthcare_attitudes-canadiens-system-soins.csv")

Some examples. You can use filter to remove rows where, for example, Q2 is "Don't know/Refuse":

cdn_attitudes %>%
  filter(Q2 != "Don't know/Refuse")

You can use mutate with gsub and as.numeric to remove anything "not a digit" and convert to numbers:

cdn_attitudes %>%
  mutate(Q2 = gsub("\\D+", "", Q2)) %>%
  mutate(Q2 = as.numeric(Q2))

Now to get more complicated. We can filter_at to filter on more than one column, and mutate_at to mutate values in more than one column, at the same time.

So to filter rows on both Q2 and Q3, then convert to numeric:

cdn_attitudes %>% 
  filter_at(vars(Q2, Q3), 
            all_vars(. != "Don't know/Refuse")) %>% 
  mutate_at(vars(Q2, Q3), 
            funs(gsub("\\D+", "", .))) %>% 
  mutate_at(vars(Q2, Q3), 
            funs(as.numeric(.)))

You should consider whether removing all rows with "Don't know/Refuse" is really what you want to do - might be better to convert them e.g. to NA, depending on the downstream analysis.

neilfws
  • 32,751
  • 5
  • 50
  • 63
0

You can use readr::parse_number for this :

library(readr)
df1 <- data.frame(rate =c("1 - terrible","Don't know", "2","3","4",
                          "10 - Excellent", "Refused"))
df1$clean_rate <- parse_number(df1$rate,c("Don't know","Refused"))
df1
#             rate clean_rate
# 1   1 - terrible          1
# 2     Don't know         NA
# 3              2          2
# 4              3          3
# 5              4          4
# 6 10 - Excellent         10
# 7        Refused         NA

then remove NAs if you wish, one way to do it is :

df1 <- df1[!is.na(df1$clean_rate),]
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167