0

I am trying to import my data using read_excel but I need to interpret any NA string value to missing values but I am stuck

Currently, my data has NAs all over the place and I need them to be blank so that when I run colsums(is.na(data)) it shouldn't show 0s

NA is chr in this table within numeric numbers as you can see in screenshot

data <- read_excel(workbook_path, na = c(""))
colSums(is.na(data))

enter image description here

enter image description here

enter image description here

Phil
  • 7,287
  • 3
  • 36
  • 66
Issam K
  • 9
  • 3
  • did you try read_excel(workbook_path, na = c("", "NA")) ? – Arthur Yip Jan 26 '23 at 00:35
  • Does this answer your question? [Using read\_excel(na = ) how do you specify more than one NA character string?](https://stackoverflow.com/questions/42635311/using-read-excelna-how-do-you-specify-more-than-one-na-character-string) – Arthur Yip Jan 26 '23 at 00:36
  • @ArthurYip I tried that code but ```{r} data <- read_excel(workbook_path, na = c("", "NA")) print(data) colSums(is.na(data)) ``` I keep seeing 0 which means I have NAs still – Issam K Jan 26 '23 at 00:37
  • I just ran sum(is.na(data)) and I get 40 which means my NAs didn't disappear – Issam K Jan 26 '23 at 00:39
  • You needing them to be "blank" and wanting them to interpreted as "missing" are 2 different things. read_excel(na = c("NA")) will force them to become a real NA, as in missing data, and this is what the function is.na is picking up. But you actually want them to be blank strings? – Arthur Yip Jan 26 '23 at 00:45
  • @ArthurYip yes I need them to not show up when I run sum(is.na(data)) they need to be gone. I tried read_excel(na = c("")) but I keep getting an error – Issam K Jan 26 '23 at 00:49
  • If you really want blank strings, you can first read_excel(na = c("NA")) and convert your missing data into blanks with df[is.na(df)] <- "" https://stackoverflow.com/questions/19592706/replace-missing-values-na-with-blank-empty-string – Arthur Yip Jan 26 '23 at 00:50
  • what do you mean not show up or gone? you want to delete the rows? cells? Are you sure "blanks" are the right data type you need? – Arthur Yip Jan 26 '23 at 00:51
  • ok so ran ```{r} data <- read_excel(workbook_path, na = c("NA")) print(data) sum(is.na(data)) ``` But I still get 40 – Issam K Jan 26 '23 at 00:55
  • 40 means you have 40 cells with missing data. Turning them blank is typically not useful. Do you understand the difference between missing and "blank" "" ? Please read all the answers and comments and links here: https://stackoverflow.com/questions/19592706/replace-missing-values-na-with-blank-empty-string https://stackoverflow.com/questions/16253789/what-is-the-difference-between-na-and-na/16253827#16253827 – Arthur Yip Jan 26 '23 at 00:57
  • So how can I avoid seeing 0s when I run colsums(is.na(data)) because my professor is asking "How many missing values are in each column? If you see zero that means you didn’t complete exercise correctly" – Issam K Jan 26 '23 at 00:58
  • If you get 40 when you run sum(is.na(data)), you shouldn't see all 0s when you run colsums on the same data where you used read_excel(... , na = "NA") – Arthur Yip Jan 26 '23 at 01:00
  • I am unfortunately and I am really annoyed with it ```{r} data <- read_excel(workbook_path, na = c("NA")) colSums(is.na(data)) sum(is.na(data)) ``` screenshot uploaded of the result – Issam K Jan 26 '23 at 01:03
  • what do you get when you do sum(colSums(is.na(data))) ? Are you sure you're seeing all the columns on your screen? – Arthur Yip Jan 26 '23 at 01:05
  • I get 40 so I am not sure why it is so difficult – Issam K Jan 26 '23 at 01:06
  • Seems like you are misunderstanding your professor's hint. Your professor just means you shouldn't get all zeros. But some columns have zero missing values, right? – Arthur Yip Jan 26 '23 at 01:07
  • I am not sure how to reaad the sum(colsums(is.na(data)) I thought he is saying we shouldn't be seeing 0s at all, I can upload the screenshot of the excel file. – Issam K Jan 26 '23 at 01:10
  • sum(colsums(is.na(data))) and sum(is.na(data)) is the same thing, 40. 40 is the total number of missing values in your data. Your professor asked you for missing values for each column. So what's wrong with the colsums(is.na(data) result, where some are zero, but some are not? – Arthur Yip Jan 26 '23 at 01:11
  • 1
    ahhhhhhh you are right, I was interpreting the question wrong !!!!!!! seeing 40 is the right way. thank you – Issam K Jan 26 '23 at 01:12
  • Welcome to SO! Its usually recommended to include a reproducible dataset with questions. Can you run `dput(data)` and paste the output into your question? Screenshots and tables are unfortunately not very helpful. – Shawn Hemelstrand Jan 26 '23 at 07:33

1 Answers1

1

Using one of readxl example files for reproducible example, you can open its location by running browseURL(dirname(readxl_example("type-me.xlsx"))), though the sheet looks like this:

enter image description here

library(readxl)
library(dplyr)
xlsx <- readxl_example("type-me.xlsx")
# open file location explorer: 
# browseURL(dirname(readxl_example("type-me.xlsx")))

# by default blank cells are treated as missing data, note the single <NA>:
df <- read_excel(xlsx, sheet = "text_coercion") %>% head(n = 2)
df
#> # A tibble: 2 × 2
#>   text    explanation  
#>   <chr>   <chr>        
#> 1 <NA>    "empty"      
#> 2 cabbage "\"cabbage\""

# add "empty" to na vector, note 2 <NA> values:
df <- readxl::read_excel(xlsx, sheet = "text_coercion", na = c("", "empty")) %>% head(n = 2)
df
#> # A tibble: 2 × 2
#>   text    explanation  
#>   <chr>   <chr>        
#> 1 <NA>     <NA>        
#> 2 cabbage "\"cabbage\""

# to replace all(!) NA values with ""
df[is.na(df)] <- ""
df
#> # A tibble: 2 × 2
#>   text      explanation  
#>   <chr>     <chr>        
#> 1 ""        ""           
#> 2 "cabbage" "\"cabbage\""

Created on 2023-01-26 with reprex v2.0.2

Note from your screenshot: you have column names in the first row of your dataframe, this breaks data type detection (everything is chr) and you should deal with that first; at that point data[is.na(data)] <- "" will no longer work as you can not write strings to numerical columns. And it's perfectly fine.

margusl
  • 7,804
  • 2
  • 16
  • 20