0

While currently performing statistical analysis on data containing meaningful 0 values and columns with real missing values (i.e. meaningless NAs), I wanted to ask for some help.

  1. I would like to perform a Principal Component Analysis on a dataset containing multiple 0 values which are not missing data (i.e. they refer to temperatures). The purpose would be to cluster the data according to the variation of temperatures of different locations and seasons. Therefore, as prcomp() function considers 0 values to be missing values in R, I would like to know what could prevent me from adding a constant (such as 1) to the whole dataset. This way, 0 values will be transformed into 1, and this constant will also be added to every numerical variables I have in my dataset. By doing so, I assume I could keep the original variation of the data, without technically impeding R to perform the PCA I want it to perform. But as I am not really confident in this method, I wanted to ask you if anything could prevent me from doing so.
# Create a reproducible dataset
my_df <- data.frame(
        Location = rep(LETTERS[1:6], 1000/2), 
        Zone = sample(c("Europe", " America", "Africa", "Antartic"), replace = TRUE),
        Temperatures = round(rnorm(1000), digits = 2)*10,
        RISK_MM = round(rnorm(1000), digits = 2)*100,
        Pressure = round(rnorm(1000), digits = 2)*1000,
        Sunshine = round(rnorm(1000), digits = 2))

# Add "0" values and NAs to my dataset in regards of specific categorical variables
my_df <- my_df %>% 
     mutate(
     Temperatures = case_when(
           str_detect(Zone,"Antartic") & str_detect(Location,"A") ~ 0,
           str_detect(Zone,"Antartic") & str_detect(Location,"B") ~ 0,
           str_detect(Zone,"Europe") & str_detect(Location,"A") ~ 0,
           str_detect(Zone,"Europe") & str_detect(Location,"B") ~ 0,
           str_detect(Zone,"America") & str_detect(Location,"C") ~ 0,
           str_detect(Zone,"Antartic") & str_detect(Location,"C") ~ NA_real_,
           str_detect(Zone,"Africa") & str_detect(Location,"D") ~ NA_real_,
           str_detect(Zone,"Africa") & str_detect(Location,"F") ~ NA_real_,
            TRUE ~ as.numeric(as.character(Temperatures))))

# Convert characters into factors
my_df <- mutate_if(my_df, is.character, as.factor)

# Print the results
print(my_df)

  1. Some columns have also missing values that I thought imputing prior to performing a PCA using the mice() function in regards of my categorical independent variables of interest such as:
# Run the multiple (m = 5) imputation
imp <- my_df %>%
  group_by(Location, Zone) %>% 
  mice(m = 5, maxit = 50, method = "cart", seed = 123)

# Create a dataset after imputation
completeImputedData <- complete(imp, 1)

# Convert the initial dataset to a numerical dataset
completeImputedData_num <- completeImputedData %>%
  ungroup() %>%
  dplyr::select(is.numeric) 

# Add a constant 
completeImputedData_num_cs <- completeImputedData_num + 1

# Dimension reduction using PCA and scale the data.
my_pca <- prcomp(completeImputedData_num_cs,  scale = TRUE, center = TRUE)

# Keep going...

However, do you think these methods could be suitable for my needs? Or should I be recommended to investigate a different clustering method or another way to impute the data?

Thanking you for your attention, I wish you a very nice day.

Best regards,

Philippe

  • 1
    To get great answers quickly, add a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) of your data. – Kat Aug 21 '21 at 20:53
  • Thank you for your feedback @Kat. I just added a reproducible example. – Philippe Duteil Aug 26 '21 at 14:19

1 Answers1

0

I don't think your zero values are affecting your analysis. Let me explain. When you use PCA it is important to scale the data, which you did. Did you look at what that does to your data?

Starting with your code - I added the libraries and set.seed().

library(tidyverse)
library(mice)

# Create a reproducible dataset
set.seed(22123)                    # I added this to make this reproducible
my_df <- data.frame(
  Location = rep(LETTERS[1:6], 1000/2), 
  Zone = sample(c("Europe", " America", "Africa", "Antartic"), replace = TRUE),
  Temperatures = round(rnorm(1000), digits = 2)*10,
  RISK_MM = round(rnorm(1000), digits = 2)*100,
  Pressure = round(rnorm(1000), digits = 2)*1000,
  Sunshine = round(rnorm(1000), digits = 2))

# Add "0" values and NAs to my dataset in regards of specific categorical variables
my_df <- my_df %>% 
  mutate(
    Temperatures = case_when(
      str_detect(Zone,"Antartic") & str_detect(Location,"A") ~ 0,
      str_detect(Zone,"Antartic") & str_detect(Location,"B") ~ 0,
      str_detect(Zone,"Europe") & str_detect(Location,"A") ~ 0,
      str_detect(Zone,"Europe") & str_detect(Location,"B") ~ 0,
      str_detect(Zone,"America") & str_detect(Location,"C") ~ 0,
      str_detect(Zone,"Antartic") & str_detect(Location,"C") ~ NA_real_,
      str_detect(Zone,"Africa") & str_detect(Location,"D") ~ NA_real_,
      str_detect(Zone,"Africa") & str_detect(Location,"F") ~ NA_real_,
      TRUE ~ as.numeric(as.character(Temperatures))))

# Convert characters into factors
my_df <- mutate_if(my_df, is.character, as.factor)
summary(my_df)
funModeling::df_status(my_df)
head(my_df)

# Run the multiple (m = 5) imputation
imp <- my_df %>%
  group_by(Location, Zone) %>% 
  mice(m = 5, maxit = 50, method = "cart", seed = 123,
       printFlag = F)                                 # I added this

# Create a dataset after imputation
completeImputedData <- complete(imp, 1)

# Convert the initial dataset to a numerical dataset
completeImputedData_num <- completeImputedData %>%
  ungroup() %>%
  dplyr::select(where(is.numeric))                 # I added where()

# Add a constant 
completeImputedData_num_cs <- completeImputedData_num + 1

# Dimension reduction using PCA and scale the data.
(my_pca <- prcomp(completeImputedData_num_cs,  
                  scale = TRUE, 
                  center = TRUE))         # I added the encapsulating parentheses 
                                          # (print and create object simultaneously)

# Standard deviations (1, .., p=4):
# [1] 1.0413130 1.0067763 0.9933300 0.9567467
# 
# Rotation (n x k) = (4 x 4):
#                     PC1          PC2          PC3         PC4
# Temperatures  0.6329000 -0.330262830  0.295715996  0.63475670
# RISK_MM      -0.3136613 -0.629210192  0.638899673 -0.31227925
# Pressure      0.7077274  0.003289435  0.005391224 -0.70645740
# Sunshine     -0.0132701 -0.703569596 -0.710162089 -0.02198944

But what if you scaled and centered outside the call for PCA?

df <- scale(completeImputedData_num_cs)

(my_pca <- prcomp(df,
                  scale = F,   # added for clarity only
                  center = F))

# Standard deviations (1, .., p=4):
# [1] 1.0413130 1.0067763 0.9933300 0.9567467
# 
# Rotation (n x k) = (4 x 4):
#                     PC1          PC2          PC3         PC4
# Temperatures  0.6329000 -0.330262830  0.295715996  0.63475670
# RISK_MM      -0.3136613 -0.629210192  0.638899673 -0.31227925
# Pressure      0.7077274  0.003289435  0.005391224 -0.70645740
# Sunshine     -0.0132701 -0.703569596 -0.710162089 -0.02198944

Same results of PCA. There are no zeros. When you scaled the data, the zeros were no longer zeros. Check it out:

funModeling::df_status(completeImputedData_num_cs)

#       variable q_zeros p_zeros q_na p_na q_inf p_inf    type unique
# 1 Temperatures      12     0.4    0    0     0     0 numeric    383
# 2      RISK_MM      21     0.7    0    0     0     0 numeric    378
# 3     Pressure       0     0.0    0    0     0     0 numeric    376
# 4     Sunshine       9     0.3    0    0     0     0 numeric    364 

funModeling::df_status(df)

#           variable q_zeros p_zeros q_na p_na q_inf p_inf    type unique
# 1 var.Temperatures       0       0    0    0     0     0 numeric    383
# 2      var.RISK_MM       0       0    0    0     0     0 numeric    378
# 3     var.Pressure       0       0    0    0     0     0 numeric    376
# 4     var.Sunshine       0       0    0    0     0     0 numeric    364 
Kat
  • 15,669
  • 3
  • 18
  • 51