0

I was playing around with the titanic dataset and was applying the basics which I had learned recently and faced the below error, please find the scenario below

titanic <- fread("titanic3.csv")

Next I just tried to check for empty string in a particular column

titanic[embarked==""]

I get 3 rows as having an empty string in this column.

Next I found that there were missing values (NA) for age so I took an average and substituted the missing age values according to the sex,

titanic <- titanic %>% group_by(sex) %>% mutate(age=if_else(is.na(age), mean(age, na.rm = TRUE), age))

After this I noticed in the View(titanic) that there were empty strings in the 'boat' column of the data frame as well.

So, just like the first query of 'embarked' column I tried to find the empty strings in the 'boat' column using the following query so that I can replace it with NA but I get the error message as follows.

titanic[boat=='']
Error in `[.data.frame`(titanic, boat == "") : object 'boat' not found

I noticed that I am getting this error message only after I had refreshed the 'age' column in the titanic dataframe with the mean age values. But I run this same code before refreshing the dataframe I do not get this error message.

I am not able to understand why I am getting this error or mistake that I am doing!

cyborg
  • 431
  • 1
  • 6
  • 20

1 Answers1

1

Try the which statement:

library(tidyverse)
titanic <- fread("titanic3.csv")

titanic <- titanic %>% group_by(sex) %>% mutate(age=if_else(is.na(age), mean(age, na.rm = TRUE), age))

titanic[which(titanic$boat == ''),]

It outputs:

# A tibble: 823 x 14
# Groups:   sex [2]
   pclass survived                                            name    sex      age sibsp parch   ticket     fare   cabin embarked  boat  body
    <int>    <int>                                           <chr>  <chr>    <dbl> <int> <int>    <chr>    <dbl>   <chr>    <chr> <chr> <chr>
 1      1        0                    Allison, Miss. Helen Loraine female  2.00000     1     2   113781 151.5500 C22 C26        S            
 2      1        0            Allison, Mr. Hudson Joshua Creighton   male 30.00000     1     2   113781 151.5500 C22 C26        S         135
 3      1        0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.00000     1     2   113781 151.5500 C22 C26        S            
 4      1        0                          Andrews, Mr. Thomas Jr   male 39.00000     0     0   112050   0.0000     A36        S            
 5      1        0                         Artagaveytia, Mr. Ramon   male 71.00000     0     0 PC 17609  49.5042                C          22
 6      1        0                          Astor, Col. John Jacob   male 47.00000     1     0 PC 17757 227.5250 C62 C64        C         124
 7      1        0                             Baumann, Mr. John D   male 30.58523     0     0 PC 17318  25.9250                S            
 8      1        0                        Baxter, Mr. Quigg Edmond   male 24.00000     0     1 PC 17558 247.5208 B58 B60        C            
 9      1        0                             Birnbaum, Mr. Jakob   male 25.00000     0     0    13905  26.0000                C         148
10      1        0                    Blackwell, Mr. Stephen Weart   male 45.00000     0     0   113784  35.5000       T        S            
# ... with 813 more rows, and 1 more variables: home.dest <chr>

The same statement does not work because the mutation changes the data type from data.table to grouped_df.

psteinroe
  • 493
  • 1
  • 6
  • 18
  • yes, this worked. Would you be able to explain why I was able to get before grouping without using which statement and later was not able to getting an error statement – cyborg Jan 17 '18 at 07:11
  • Sure. The mutation changed the data type to a grouped dataframe. With dataframes, the which statement has to be used. I would say, that the which statement is always preferable, as it works on a dataframe and a data.table. I can not say why that is the case, as I am not that experienced with data.tables. Hope it helps :) – psteinroe Jan 17 '18 at 07:14
  • ah! that's a good explanation. If I may ask, can you also clarify when all I should be taking care of using a which statement? Is it when I have grouped the data or is it when I try to perform operation on a dataframe? I am just trying to understand as much as possible. – cyborg Jan 17 '18 at 07:18
  • according to what you said I tried loading the data as `titanic <- fread("titanic3.csv",data.table = FALSE)` as data.frame using the fread function and now when I read `titanic[embarked==""]` throws an error whereas when I use 'which' statement runs. I am not sure how to understand the difference between data.frame and data.table when and where to use them. – cyborg Jan 17 '18 at 07:44
  • Before I tell something wrong, I'd rather forward you to [this post](https://stackoverflow.com/questions/13618488/what-you-can-do-with-data-frame-that-you-cant-in-data-table), which explains subsetting in R fairly well. – psteinroe Jan 17 '18 at 07:45
  • According to your question about data.frame vs data.table, I went on a quick search and [this question](https://stackoverflow.com/questions/18001120/what-is-the-practical-difference-between-data-frame-and-data-table-in-r) explains the differences. (I learned something new through that too - thank you) – psteinroe Jan 17 '18 at 07:49
  • thank you so much, you have been very patient in explaining. I really appreciate you responding. – cyborg Jan 17 '18 at 07:50
  • No worries, I would be happy if you would accept my answer. Also, check [this question](https://stackoverflow.com/questions/13618488/what-you-can-do-with-data-frame-that-you-cant-in-data-table), as it explains the differences better. – psteinroe Jan 17 '18 at 07:51
  • In reference to the suggestion given by you, `titanic[which(titanic$boat == ''),]` , using this code I was able to see the **boat** where empty strings are present. Now I want to insert these empty strings with NA so when give the code `titanic[which(titanic$boat == ''),]<-NA` the entire rows matching the criteria all the values are getting replaced by NA. Why is it? and how can I only replace the boat column values only. – cyborg Jan 18 '18 at 04:18
  • Hi, sorry for the late reply. if you want to replace all blank values with NA, use `titanic <- fread("titanic3.csv",na.strings=c(""))`. If you want to do it afterwards , try the following: `titanic$boat[titanic$boat == ""] <- NA` – psteinroe Jan 18 '18 at 19:53
  • oh interesting, I was breaking my head around which function. – cyborg Jan 19 '18 at 00:07