0

I have a dataset like this that I have turned into a massive dendrogram using time series clustering:

DF<-structure(list(`Smith, Sumner` = c(" 0", " 0", " 0", " 0", " 0", 
                                    " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", 
                                    "  0", "  0", "  0", "  0", "  0", "  0", "  0", "  0", "  0", 
                                    "  0", "  0", "  1", "  1", "  1", "  1", "  2", "  3", "  7", 
                                    " 15", " 22", " 25", " 31", " 32", " 40", " 41", " 45", " 47", 
                                    " 48", " 48", " 49", " 49", " 49", " 49", " 49", " 49"), `Fizzle III, Joseph` = c(" 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", 
                                                                                                                     " 0", " 0", " 0", " 0", "  0", "  0", "  0", "  0", "  0", "  0", 
                                                                                                                     "  0", "  0", "  0", "  0", "  0", "  0", "  0", "  0", "  0", 
                                                                                                                     "  5", "  6", "  7", "  9", "  9", " 11", " 21", " 25", " 33", 
                                                                                                                     " 38", " 44", " 51", " 54", " 57", " 60", " 61", " 67", " 72", 
                                                                                                                     " 73", " 73"), `johnson, Barry` = c(" 0", " 0", " 0", " 0", " 0", 
                                                                                                                                                      " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", 
                                                                                                                                                      "  0", "  0", "  0", "  0", "  0", "  0", "  0", "  0", "  0", 
                                                                                                                                                      "  0", "  0", "  0", "  0", "  0", "  0", "  1", "  5", "  7", 
                                                                                                                                                      " 11", " 12", " 17", " 20", " 21", " 24", " 25", " 28", " 28", 
                                                                                                                                                      " 28", " 28", " 28", " 31", " 31", " 33", " 33", " 33"), `peanut, Mark` = c(" 0", 
                                                                                                                                                                                                                                   " 0", " 0", " 0", " 0", " 0", " 0", " 1", " 2", " 5", "10", "18", 
                                                                                                                                                                                                                                   "22", "23", "27", "28", " 30", " 34", " 42", " 44", " 48", " 51", 
                                                                                                                                                                                                                                   " 62", " 64", " 65", " 66", " 67", " 68", " 73", " 75", " 76", 
                                                                                                                                                                                                                                   " 81", " 86", " 89", " 89", " 92", " 94", "102", "111", "118", 
                                                                                                                                                                                                                                   "133", "141", "146", "157", "158", "158", "158", "158", "158", 
                                                                                                                                                                                                                                   "158", "158"), `alpha, John A` = c(" 0", " 0", " 0", " 0", 
                                                                                                                                                                                                                                                                        " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", 
                                                                                                                                                                                                                                                                        " 0", "  0", "  0", "  0", "  0", "  0", "  0", "  0", "  0", 
                                                                                                                                                                                                                                                                        "  0", "  0", "  6", " 11", " 13", " 15", " 17", " 20", " 31", 
                                                                                                                                                                                                                                                                        " 35", " 41", " 46", " 53", " 59", " 69", " 87", " 91", " 93", 
                                                                                                                                                                                                                                                                        "103", "127", "133", "133", "133", "133", "133", "133", "133"
                                                                                                                                                                                                                                   ), `barry, Lloyd Alan` = c(" 0", " 0", " 0", " 1", " 2", " 2", 
                                                                                                                                                                                                                                                                " 3", " 3", " 3", " 3", " 3", " 5", " 7", "11", "13", "18", " 23", 
                                                                                                                                                                                                                                                                " 23", " 23", " 27", " 28", " 31", " 32", " 32", " 33", " 33", 
                                                                                                                                                                                                                                                                " 33", " 33", " 33", " 33", " 33", " 33", " 33", " 33", " 33", 
                                                                                                                                                                                                                                                                " 33", " 33", " 33", " 33", " 33", " 33", " 33", " 33", " 33", 
                                                                                                                                                                                                                                                                " 33", " 33", " 33", " 33", " 33", " 33", " 33"), `smith, EK` = c(" 0", 
                                                                                                                                                                                                                                                                                                                                    " 0", " 2", " 3", " 3", " 3", " 4", " 6", " 6", " 6", " 6", " 6", 
                                                                                                                                                                                                                                                                                                                                    " 6", " 7", "14", "15", " 18", " 25", " 28", " 29", " 33", " 37", 
                                                                                                                                                                                                                                                                                                                                    " 45", " 49", " 51", " 54", " 61", " 65", " 65", " 70", " 75", 
                                                                                                                                                                                                                                                                                                                                    " 79", " 79", " 81", " 82", " 83", " 87", " 89", " 89", " 91", 
                                                                                                                                                                                                                                                                                                                                    " 91", " 91", " 91", " 93", " 95", " 95", " 98", " 98", " 99", 
                                                                                                                                                                                                                                                                                                                                    "100", "100"), `parvin, Eric David` = c(" 0", " 0", " 0", " 0", 
                                                                                                                                                                                                                                                                                                                                                                            " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", " 0", 
                                                                                                                                                                                                                                                                                                                                                                            " 0", "  0", "  4", "  6", "  6", "  6", "  6", "  6", "  6", 
                                                                                                                                                                                                                                                                                                                                                                            "  6", "  6", "  6", "  6", "  6", "  6", "  7", "  7", "  9", 
                                                                                                                                                                                                                                                                                                                                                                            " 10", " 10", " 10", " 10", " 10", " 10", " 10", " 10", " 10", 
                                                                                                                                                                                                                                                                                                                                                                            " 10", " 10", " 10", " 10", " 10", " 10", " 10", " 10", " 10"
                                                                                                                                                                                                                                                                                                                                    ), `Burgess, Gary` = c(" 0", " 0", " 0", " 1", " 1", " 1", 
                                                                                                                                                                                                                                                                                                                                                                 " 1", " 1", " 1", " 1", " 1", " 1", " 1", " 1", " 1", " 3", "  5", 
                                                                                                                                                                                                                                                                                                                                                                 "  5", "  5", "  6", "  7", "  7", "  8", "  8", "  8", "  9", 
                                                                                                                                                                                                                                                                                                                                                                 "  9", "  9", "  9", " 11", " 11", " 11", " 11", " 12", " 12", 
                                                                                                                                                                                                                                                                                                                                                                 " 14", " 14", " 15", " 15", " 17", " 17", " 17", " 18", " 18", 
                                                                                                                                                                                                                                                                                                                                                                 " 18", " 18", " 18", " 18", " 18", " 18", " 18"), `smith, john` = c(" 0", 
                                                                                                                                                                                                                                                                                                                                                                                                                                            " 0", " 0", " 0", " 1", " 1", " 3", " 6", " 6", " 6", " 8", " 8", 
                                                                                                                                                                                                                                                                                                                                                                                                                                            " 8", " 8", " 8", " 8", "  8", "  8", "  8", "  9", " 10", " 11", 
                                                                                                                                                                                                                                                                                                                                                                                                                                            " 13", " 14", " 16", " 16", " 17", " 18", " 18", " 19", " 20", 
                                                                                                                                                                                                                                                                                                                                                                                                                                            " 20", " 20", " 21", " 21", " 22", " 22", " 22", " 22", " 22", 
                                                                                                                                                                                                                                                                                                                                                                                                                                            " 22", " 22", " 22", " 22", " 22", " 22", " 22", " 22", " 22", 
                                                                                                                                                                                                                                                                                                                                                                                                                                            " 22", " 22")), row.names = c(NA, -51L), class = c("tbl_df", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               "tbl", "data.frame"))

P.s. anyone know why it pastes so weird like that when I copy from RStudio? With giant spaces?

Anyhow, in the data, each column is a person (names should be scrambled) and the rows represent years, where a certain number of events happened each year. I've used time series clustering with the real data set (hundreds of columns) to create a dendrogram that groups the most similar columns together. I can access that grouping in a data frame that looks like this:

DF2<-structure(list(type_col = c("Smith, Sumner", "josephs, Joseph", 
"smith, Barry", "johnson, Mark", "Peanut, John A", "smithy, Lloyd Alan", 
"john, EK", "Amistad, Eric David", "Hotdog, Gary ", "Jones, SMith"
), cluster_group = c(1L, 2L, 2L, 1L, 3L, 3L, 1L, 1L, 2L, 1L)), row.names = c(NA, 
10L), class = "data.frame")

So this shows me the names (I apologize these aren't the exact same names shown in the other example data) and their respective groups.

What I would love to do is plot something like this (ignore the "90's" and "80's", where it says A or B I'd like that to be group 1 or 2 respectively) enter image description here

Where I would take each respective group, and then "average" their data to create a line for each group over time. Does that make sense? I know that ggplot can use a "grouping" variable, and I also know that multiple geom_lines can be on a single graph, but besides that I am totally lost. Help!

Joe Crozier
  • 944
  • 8
  • 20
  • What you want is a plot only for the names in `DF2` and avoiding all values between 80 and 90? – Duck Aug 14 '20 at 14:05
  • I'm sorry, the 80 and 90 is not relevant for my question (just happened to be in the photo I used as an example, I should have used a better photo). What I want specifically is: lets say that I know (based on DF2) that barry smith and joseph josephs belong to the same group (lets say group2). So based on the information in DF, it would take the two columns for those guys and average what they had in each row. This new average for each row would be what would be graphed for group2. It would do this for every group. So the graph would have a line for each group. Does that make sense? – Joe Crozier Aug 14 '20 at 14:08
  • And the names of `DF` with no fuzzy macth in `DF2` are one group only or one group each or something else? – Rui Barradas Aug 14 '20 at 14:19
  • I should have taken my time with the names. In reality there are absolutely always matches of the names between DF and DF2, DF2 was generated FROM DF. The reason it may not look like they always match is that in my haste to put it on here, I changed the names by hand in order to anonymize, and didn't do a good job. I'm sorry – Joe Crozier Aug 14 '20 at 14:21

1 Answers1

1

This is mostly a data reshaping problem. First convert DF from wide to long format, then merge with DF2, summarise by groups of time and cluster. Finally, plot the result.

In order to have matching names in DF and DF2, I have changed the posted data.

library(tidyverse)

DF[] <- lapply(DF, function(x) as.numeric(as.character(x)))
names(DF) <- LETTERS[seq_len(ncol(DF))]

DF2$type_col <- LETTERS[seq_len(ncol(DF))]

DF %>%
  rownames_to_column(var = "time") %>%
  mutate(time = as.integer(time)) %>%
  pivot_longer(
    cols = -time,
    names_to = "type_col",
    values_to = "Value"
  ) %>%
  left_join(DF2, by = "type_col") %>%
  mutate(cluster_group = factor(cluster_group)) %>%
  group_by(time, cluster_group) %>%
  summarise(Mean = mean(Value, na.rm = TRUE), .groups = "drop_last") %>%
  ggplot(aes(time, Mean, color = cluster_group)) +
  geom_line()

enter image description here

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • it worked great! With my real data. What's weird though is its not working with a different set of data I have thats nearly identical. Rather than paste everything in this thread is there a way I can direct message you the link to where I have this on Rstudio cloud and have you take a look at it? Weird error that says "Error in .subset2(chunks, self$get_current_group()) : attempt to select less than one element in integerOneIndex" – Joe Crozier Aug 14 '20 at 15:34
  • @JoeCrozier You can paste a link to the data, if you want to. In the mean time, `Run ``rlang::last_error()`` to see where the error occurred.` Maybe remove the `.groups` argument from `summarise` – Rui Barradas Aug 14 '20 at 15:56
  • I'm hesitant to post the link for everyone to see, but I can provide some detail, maybe it'll be helpful for someone else if they ever run this. So if I run: "DF %>% rownames_to_column(var = "time")" everything works fine, but the second I add the next line: "DF %>% rownames_to_column(var = "time") %>% mutate(time = as.integer(time))" I get the error. When I visually look at the time variable (if I assign it as a data frame) it doesn't visually LOOK any different than when I run it with the data that works. The time column runs from 1 to 41 and looks normal? – Joe Crozier Aug 14 '20 at 16:17
  • @JoeCrozier OK, is it a factor? Maybe you don't need to post data, many times it's private. Pipe `rownames_to_column` to `str` to give the data structure and see what's in there. I mean `rownames_to_column(var = "time") %>% str()` will end the pipe. – Rui Barradas Aug 14 '20 at 16:25
  • @ Rui Barradas looks like they're all num's, except for the time variable which is character. This is the same as in the dataset that works fine. I mean really, side by side with the dataset that works fine, it looks almost identical. I know both datasets are nothing but numbers, and there are no missing data (as they both work when creating dendrograms), and the only difference so far as I can tell is one dataset has 51 rows, one has 41 (that doesn't work). One has 264 columns, one has 257 (that doesn't work) – Joe Crozier Aug 14 '20 at 16:30
  • @JoeCrozier Maybe `mutate(time = row_number())` instead of `rownames_to_column`. – Rui Barradas Aug 14 '20 at 16:37
  • Found it! Thank you! When I ran: DF%>%rownames_to_column(var = "time") %>% str() it said column name "time" must not be duplicated, which I thought was weird because we just created it... started wondering if I somehow had a "time" variable prior. I didn't, but I did find a duplicate column name. When I finally erased the duplicate, everything worked great! Thank you for all your help during this – Joe Crozier Aug 14 '20 at 17:08