5

Given a data frame containing mixed variables (i.e. both categorical and continuous) like,

digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}

df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
                 studLoc=sample(createRandString(10)),
                 finalmark=sample(c(0:100),10),
                 subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
                 )

I perform unsupervised feature selection using the package FactoMineR

df.princomp <- FactoMineR::FAMD(df, graph = FALSE)

The variable df.princomp is a list.

Thereafter, to visualize the principal components I use fviz_screeplot() and fviz_contrib() like,

#library(factoextra)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
                           barfill = "gray", barcolor = "black",
                           ylim = c(0, 50), xlab = "Principal Component", 
                           ylab = "Percentage of explained variance",
                           main = "Principal Component (PC) for mixed variables")

factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = 1, top = 10, sort.val = c("desc"))

which gives the following Fig1

enter image description here

and Fig2

enter image description here

Explanation of Fig1: The Fig1 is a scree plot. A Scree Plot is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each Principal Component (PC). So we can see the first three PCs collectively are responsible for 43.8% of total variance. The question now naturally arises, "What are these variables?". This I have shown in Fig2.

Explanation of Fig2: This figure visualizes the contribution of rows/columns from the results of Principal Component Analysis (PCA). From here I can see the variables, name, studLoc and finalMark are the most important variables that can be used for further analysis.

Further Analysis- where I'm stuck at: To derive the contribution of the aforementioned variables name, studLoc, finalMark. I use the principal component variable df.princomp (see above) like df.princomp$quanti.var$contrib[,4]and df.princomp$quali.var$contrib[,2:3].

I've to manually specify the column indices [,2:3] and [,4].

What I want: I want to know how to do dynamic column index assignment, such that I do not have to manually code the column index [,2:3] in the list df.princomp?

I've already looked at the following similar questions 1, 2, 3 and 4 but cannot find my solution? Any help or suggestions to solve this problem will be helpful.

mnm
  • 1,962
  • 4
  • 19
  • 46
  • So just to be clear, what exactly is the desired result here? The column heights in figure 2? – Mikko Marttila Jul 19 '18 at 11:44
  • @MikkoMarttila thank you for your interest. It was helpful. I've updated the question so that others can understand it better (not just me :) ). I'm sure it's a trivial answer but I simply can't seem to fathom it. – mnm Jul 19 '18 at 13:36
  • so you are saying that you want some function `f(pc1,pc2)` such that if pc1 was the 2nd component and pc2 was the 3rd component then you would get `df.princomp$quanti.var$contrib[,2:3]` and `df.princomp$quali.var$contrib[,2:3]`? – august Jul 20 '18 at 17:27
  • 2
    by the way, your code is not entirely reproducible, it needs a random seed. when i run it i get slightly different pcas – august Jul 20 '18 at 17:28
  • @august thanks for the interest and for pointing out the mistake. I've updated the code. I hope its reproducible now! As already stated in the question (which I think is quite clear), I need a function that can determine the column indices in the principal components. See the principal components contain all the variables that are specified in the data. So `pc1` will contain `ID`,`name`,`studLoc`,`finalmark`,`subj1mark`,`subj1mark`. Similarly, `pc2` will also contain variables `ID`,`name`,`studLoc`,`finalmark`,`subj1mark`,`subj1mark`. The difference is in the contribution of these variables. – mnm Jul 21 '18 at 04:55
  • @august Therefore, if I code a function like `f(pc1,pc2)` such that if pc1 was the 2nd component and pc2 was the 3rd component then I will get all the variables and not `[,2:3] and [,4]`. – mnm Jul 21 '18 at 05:02
  • 2
    @Ashish Is it correct to say you believe that `df.princomp$quanti.var$contrib[,4]` is the contribution of finalMark to Dim-1? If so, I would suggest this interpretation is not correct, instead, the contribution of finalMark to Dim-1 can be found here `df.princomp$quanti.var$contrib["finalmark", "Dim.1"]` And likewise, name and studLoc's contributions to Dim-1 can be found at `df.princomp$quali.var$contrib[1:10, "Dim.1"] %>% sum()` and `df.princomp$quali.var$contrib[11:20, "Dim.1"] %>% sum()` respectively. Note these are the first three columns of figure 2 above – stevec Jul 21 '18 at 13:32
  • @user5783745 I understand the dimensions contain the contribution of the variables. Look at my response to @august above for the same. In statistics its said the first 2 PCs account for the maximum variance in the data (I can't recall the reference to quote here). Now, if I choose `PC1`, then based on this corollary, I'm saying, `df.princomp$quanti.var$contrib[,4]` is the contribution of finalMark to Dim-1. This is my understanding. – mnm Jul 24 '18 at 05:38

2 Answers2

2

Not sure if my interpretation of your question is correct, apologies if not. From what I gather you are using PCA as an initial tool to show you what variables are the most important in explaining the dataset. You then want to go back to your original data, select these variables quickly without manual coding each time, and use them for some other analysis.

If this is correct then I have saved the data from the contribution plot, filtered out the variables that have the greatest contribution, and used that result to create a new data frame with these variables alone.

digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}

df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
                 studLoc=sample(createRandString(10)),
                 finalmark=sample(c(0:100),10),
                 subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)

df.princomp <- FactoMineR::FAMD(df, graph = FALSE)

factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
                           barfill = "gray", barcolor = "black",
                           ylim = c(0, 50), xlab = "Principal Component", 
                           ylab = "Percentage of explained variance",
                           main = "Principal Component (PC) for mixed variables")

#find the top contributing variables to the overall variation in the dataset
#here I am choosing the top 10 variables (although we only have 6 in our df).
#note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2)

f<-factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = c(1), top = 10, sort.val = c("desc"))

#save data from contribution plot
dat<-f$data

#filter out ID's that are higher than, say, 20

r<-rownames(dat[dat$contrib>20,])

#extract these from your original data frame into a new data frame for further analysis

new<-df[r]

new

#finalmark name    studLoc
#1         53    b POTYQ0002N
#2         73    i LWMTW1195I
#3         95    d VTUGO1685F
#4         39    f YCGGS5755N
#5         97    c GOSWE3283C
#6         58    g APBQD6181U
#7         67    a VUJOG1460V
#8         64    h YXOGP1897F
#9         15    j NFUOB6042V
#10        81    e QYTHG0783G

Based on your comment, where you said you wanted to 'Find variables with value greater than 5 in Dim.1 AND Dim.2 and save these variables to a new data frame', I would do this:

#top contributors to both Dim 1 and 2

f<-factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = c(1,2), top = 10, sort.val = c("desc"))

#save data from contribution plot
dat<-f$data

#filter out ID's that are higher than 5

r<-rownames(dat[dat$contrib>5,])

#extract these from your original data frame into a new data frame for further analysis

new<-df[r]

new

(This keeps all the original variables in our new data frame since they all contributed more than 5% to the total variance)

J.Con
  • 4,101
  • 4
  • 36
  • 64
  • @j-con your interpretation is absolutely correct. I'm using PCA as a unsupervised feature selection method. I had a gut feeling that it was a simple answer but just could not get my head around it. Thank you very much. Please allow me some time to `incubate` :-) on this solution and I'll get back to you. – mnm Jul 24 '18 at 06:48
  • Cool! No worries. I have pulled a lot of hair out over things like this and I find that saving the data from the plot output is a very valuable trick. – J.Con Jul 24 '18 at 06:58
  • @j-con, thank you for the advice `saving the data from the plot output is a very valuable trick`. This is was the missing piece. Your solution works as desired, Also many thanks to all the other user's who greatly helped in refining the question and the subsequent answer. Without your help, it would not be possible. Cheers. – mnm Jul 24 '18 at 07:34
1

There are a lot of ways to extract contributions of individual variables to PCs. For numeric input, one can run a PCA with prcomp and look at $rotation (I spoke to soon and forgot you've got factors here so prcomp won't work directly). Since you are using factoextra::fviz_contrib, it makes sense to check how that function extracts this information under the hood. Key factoextra::fviz_contrib and read the function:

> factoextra::fviz_contrib
function (X, choice = c("row", "col", "var", "ind", "quanti.var", 
    "quali.var", "group", "partial.axes"), axes = 1, fill = "steelblue", 
    color = "steelblue", sort.val = c("desc", "asc", "none"), 
    top = Inf, xtickslab.rt = 45, ggtheme = theme_minimal(), 
    ...) 
{
    sort.val <- match.arg(sort.val)
    choice = match.arg(choice)
    title <- .build_title(choice[1], "Contribution", axes)
    dd <- facto_summarize(X, element = choice, result = "contrib", 
        axes = axes)
    contrib <- dd$contrib
    names(contrib) <- rownames(dd)
    theo_contrib <- 100/length(contrib)
    if (length(axes) > 1) {
        eig <- get_eigenvalue(X)[axes, 1]
        theo_contrib <- sum(theo_contrib * eig)/sum(eig)
    }
    df <- data.frame(name = factor(names(contrib), levels = names(contrib)), 
        contrib = contrib)
    if (choice == "quanti.var") {
        df$Groups <- .get_quanti_var_groups(X)
        if (missing(fill)) 
            fill <- "Groups"
        if (missing(color)) 
            color <- "Groups"
    }
    p <- ggpubr::ggbarplot(df, x = "name", y = "contrib", fill = fill, 
        color = color, sort.val = sort.val, top = top, main = title, 
        xlab = FALSE, ylab = "Contributions (%)", xtickslab.rt = xtickslab.rt, 
        ggtheme = ggtheme, sort.by.groups = FALSE, ...) + geom_hline(yintercept = theo_contrib, 
        linetype = 2, color = "red")
    p
}
<environment: namespace:factoextra>

So it's really just calling facto_summarize from the same package. By analogy you can do the same thing, simply call:

> dd <- factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = 1)
> dd
               name    contrib
ID               ID  0.9924561
finalmark finalmark 21.4149175
subj1mark subj1mark  7.1874438
subj2mark subj2mark 16.6831560
name           name 26.8610132
studLoc     studLoc 26.8610132

And that's the table corresponding to your figure 2. For PC2 use axes = 2 and so on.

Regarding "how to programmatically determine the column indices of the PCs", I'm not 100% sure I understand what you want, but if you just want to say for column "finalmark", grab its contribution to PC3 you can do the following:

library(tidyverse)
# make a tidy table of all column names in the original df with their contributions to all PCs
contribution_df <- map_df(set_names(1:5), ~factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = .x), .id = "PC")

# get the contribution of column 'finalmark' by name
contribution_df %>%
  filter(name == "finalmark")

# get the contribution of column 'finalmark' to PC3
contribution_df %>%
  filter(name == "finalmark" & PC == 3)

# or, just the numeric value of contribution
filter(contribution_df, name == "finalmark" & PC == 3)$contrib

BTW I think ID in your example is treated as numeric instead of factor, but since it's just an example I'm not bothering with it.

Yue Jiang
  • 1,271
  • 10
  • 15
  • thank you for the pointer to look at the function structure. It was helpful. Please, do edit your answer (if possible) to include what are those " lot of ways to extract contributions of individual variables to PCs". It will be very helpful to me and others who are interested. Continuing further, can you elaborate on "how to programmatically determine the column indices of the PCs" contained in the variable `dd` and map it to the original dataframe? Thanks. – mnm Jul 21 '18 at 17:28
  • updated the answer. does that help? i guess i still don't 100% understand what you mean by "how to programmatically determine the column indices of the PCs", but given a table of contributions of all column names to all the PCs, i guess you can extract whatever you want? – Yue Jiang Jul 22 '18 at 01:01
  • thank you for updating the answer. What i mean by "programmatically determine the column indices" is, after knowing the PCs which are essentially the variables. Is it possible to obtain the column indices of these variables?" so that I do not have to hard-code them? I hope it makes sense? Please any ideas or suggestions are welcome. – mnm Jul 22 '18 at 11:20
  • hmm i'm confused by "knowing the PCs which are essentially the variables" - PCs are not variables. also not sure what column indices are referring to. Is what you want a function? if so, it would be helpful to have an example input (which you kind of have already, but would be nice to make specific) and a well defined example output (is it a data frame? a value? a list?) – Yue Jiang Jul 22 '18 at 15:58
  • 1
    I've run out of examples to explain it further. I've been as verbose as could be possible (includes the question & in comments). Thanks for all your inputs so far, – mnm Jul 23 '18 at 00:11
  • 1
    I'd read somewhere that the first two or maybe three PCs contain the maximum variance in the data (can't recall the source for now). So based on this corollary in the above code, `df.princomp$quanti.var$contrib[,1:5]` contains the contribution of the variables in 5 dimensions namely `Dim.1`, `Dim.2` and so on. Is it possible to write code for a question like, "Find variables with value greater than 5 in Dim.1 AND Dim.2 and save these variables to a new data frame". Once these variables are determined, I can then map them to the original dataframe. PCA is akin to feature selection. – mnm Jul 24 '18 at 05:52