0

I'm pretty new to R so I was following a guide for cluster analysis, and when I get to using get_dist() I keep getting the error Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric. When I remove the column with the <chr> data, it works fine, but the thing is, I want to keep these labels in, like the "state" labels in the USArrests dataset.

I found a question that was pretty similar to mine over here, however there were no comments or answers that were helpful for me. I've seen a few posts, such as this one that mention trying get_dist(x$x) or as.numeric(as.character(x$x)), but I must admit that this work around doesn't make much sense, nor have I had much success implementing these suggestions.

I can't show my full data set, but I can provide the results of head(), and I have noticed that it differs from head(USArrests):

library(readxl)
Mother_2_ABS_Summer_2019_clean <- read_excel("~/.../Mother_2_ABS_Summer_2019_clean.xls", 
    range = "D1:H61")
head(Mother_2_ABS_Summer_2019_clean)

...1     Audience     Genre     Structure     Proofreading
<chr>    <dbl>        <dbl>     <dbl>         <dbl>
ABS-P_29_S31    2   2   2.0 3
ABS_40_S50  3   3   3.5 3
ABS_57_S47  2   2   2.0 3
ABS_86_S48  4   3   3.0 4
ABS_143_S42 2   2   2.0 3
ABS-P_152_S49   2   1   1.0 4

head(USArrests)

         Murder     Assault     UrbanPop     Rape
        <dbl>       <int>       <int>        <dbl>
Alabama 13.2    236 58  21.2
Alaska  10.0    263 48  44.5
Arizona 8.1 294 80  31.0
Arkansas    8.8 190 50  19.5
California  9.0 276 91  40.6
Colorado    7.9 204 78  38.7

So what I've noticed is that in USArrests, the state labels aren't categorized as <chr> unlike my identifications for the documents.

When I follow the guide, I have no problems up until get_dist():

dat1 <- na.omit(Mother_2_ABS_Summer_2019_clean)
dat1 <- scale(dat1)

distance <- get_dist(dat1)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

When I import only the the 4 columns that contain numeric data, and go through the guide, everything works just fine and I can view the cluster results. The problem here is I want to see the visualizations WITH the document identifications, otherwise the results don't mean to much when looking at them.

If any of you have any advice or suggestions, it would be greatly appreciated.

Illari
  • 183
  • 1
  • 12
  • If I get what you are asking, you can't use ``get_dist()`` on a full data frame if it contains characters. You could try to use ``lapply()`` or a ``for loop``. – Gainz Aug 06 '19 at 12:42
  • Yeah, I figured as much when I noticed it worked without the characters column. I should probably edit the post after I get to work, but I think my question is two-fold: Why does get_dist work on USArrests, when the first column in it - to me - looks like it is filled with characters, and how do I replicate the set-up of USArressts with my data set to get it to work the same way? I hadn't know about lapply(), so I'll go look it up when I hav the chance, thanks. – Illari Aug 06 '19 at 12:47

1 Answers1

1

UNTESTED: You could assign those labels as the row names:

library(tidyverse) Mother_2_ABS_Summer_2019_clean %>% remove_rownames %>% column_to_rownames(var="...1")

Maybe consider changing the first column name so the above is cleaner and more likely to work. Then it's the same format as the USArrests.

NoobR
  • 311
  • 2
  • 10
  • Thanks a lot, that did it! I added a column name, since in my spreadsheet it actually didn't have a column title, and I think "...1" was just RStudio's way of expressing that. – Illari Aug 07 '19 at 14:44