While currently performing statistical analysis on data containing meaningful 0 values and columns with real missing values (i.e. meaningless NAs), I wanted to ask for some help.
- I would like to perform a Principal Component Analysis on a dataset containing multiple 0 values which are not missing data (i.e. they refer to temperatures). The purpose would be to cluster the data according to the variation of temperatures of different locations and seasons. Therefore, as prcomp() function considers 0 values to be missing values in R, I would like to know what could prevent me from adding a constant (such as 1) to the whole dataset. This way, 0 values will be transformed into 1, and this constant will also be added to every numerical variables I have in my dataset. By doing so, I assume I could keep the original variation of the data, without technically impeding R to perform the PCA I want it to perform. But as I am not really confident in this method, I wanted to ask you if anything could prevent me from doing so.
# Create a reproducible dataset
my_df <- data.frame(
Location = rep(LETTERS[1:6], 1000/2),
Zone = sample(c("Europe", " America", "Africa", "Antartic"), replace = TRUE),
Temperatures = round(rnorm(1000), digits = 2)*10,
RISK_MM = round(rnorm(1000), digits = 2)*100,
Pressure = round(rnorm(1000), digits = 2)*1000,
Sunshine = round(rnorm(1000), digits = 2))
# Add "0" values and NAs to my dataset in regards of specific categorical variables
my_df <- my_df %>%
mutate(
Temperatures = case_when(
str_detect(Zone,"Antartic") & str_detect(Location,"A") ~ 0,
str_detect(Zone,"Antartic") & str_detect(Location,"B") ~ 0,
str_detect(Zone,"Europe") & str_detect(Location,"A") ~ 0,
str_detect(Zone,"Europe") & str_detect(Location,"B") ~ 0,
str_detect(Zone,"America") & str_detect(Location,"C") ~ 0,
str_detect(Zone,"Antartic") & str_detect(Location,"C") ~ NA_real_,
str_detect(Zone,"Africa") & str_detect(Location,"D") ~ NA_real_,
str_detect(Zone,"Africa") & str_detect(Location,"F") ~ NA_real_,
TRUE ~ as.numeric(as.character(Temperatures))))
# Convert characters into factors
my_df <- mutate_if(my_df, is.character, as.factor)
# Print the results
print(my_df)
- Some columns have also missing values that I thought imputing prior to performing a PCA using the mice() function in regards of my categorical independent variables of interest such as:
# Run the multiple (m = 5) imputation
imp <- my_df %>%
group_by(Location, Zone) %>%
mice(m = 5, maxit = 50, method = "cart", seed = 123)
# Create a dataset after imputation
completeImputedData <- complete(imp, 1)
# Convert the initial dataset to a numerical dataset
completeImputedData_num <- completeImputedData %>%
ungroup() %>%
dplyr::select(is.numeric)
# Add a constant
completeImputedData_num_cs <- completeImputedData_num + 1
# Dimension reduction using PCA and scale the data.
my_pca <- prcomp(completeImputedData_num_cs, scale = TRUE, center = TRUE)
# Keep going...
However, do you think these methods could be suitable for my needs? Or should I be recommended to investigate a different clustering method or another way to impute the data?
Thanking you for your attention, I wish you a very nice day.
Best regards,
Philippe