0

I have a dataframe with diagnosis in the x-axis (from diagnosis 1 to 30) and ID-numbers in the y-axis. The observations is the different diagnosis the patient have gotten by the doctor.

I had a larger dataframe which i made Traminer sequence analysis, and got the dataframe described above. it looks like this:

  • d1 (diagnose 1) etc.
  • the diagnosis i have stated below is just an example

             d1         d2      d3        d4     d5    d6   d7 etc.
          1 cancer
          2 cancer
          3 nothing
          4 nothing
          5 cancer 
          6 headache
    

So i want to make a new dataframe where i group all patients who who have "cancer" in the first diagnose, and a group with all patient who has "nothing" as first diagnose and so one. This is because the dataframe is to large and i want to minimize that way.

Data example:

set.seed(1) 
Data <- data.frame( d1 = sample(c("cancer", "cancer", "cancer",
 "cancer","nothing", "cancer","cancer", "cancer" )), d2 = sample(c("cancer",
 "headache", "cancer", "cancer", "nothing", "nothing", "nothing", "nothing")),
 d3 = sample(c("cancer", "headache", "cancer", "cancer", "headache", "nothing",
 "nothing", "headache")) )

Is that possible?

EXPECTED OUTCOME:

I expect an outcome where i can see the number of the persons who has had cancer as first diagnosis, and "nothing" as first diagnosis and so on. so maybe something like this:

        D1   D2    D3 D4 D5 ECT.
 CANCER   5    4
 HEADACHE 4    3
 NOTHING  1    3
  • It is hard to try anything without a reprex... however, dplyr::group_by(d1) should get you started – FMM Jan 03 '19 at 08:52
  • 3
    Can you provide some example data for us to play with? https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – william3031 Jan 03 '19 at 08:52
  • The example is in the question –  Jan 03 '19 at 09:14
  • 2
    what is the expected output? – Sotos Jan 03 '19 at 09:14
  • the expected outcome has been written above now :) –  Jan 03 '19 at 09:30
  • Your example of expected outcome does not correspond to what you explain. The table does not group patients, it just provides the distribution of the patients at each successive diagnosis (which you would get with `summary` or the `seqstats`function of `TraMineR`). Note that in this table you have no longer sequences. – Gilbert Jan 07 '19 at 10:07

2 Answers2

2

One way is to use convert to long format, count and then spread to go to wide format again. Using tidyverse to do it,

library(tidyverse)

Data %>% 
 gather(var, val) %>% 
 group_by_all() %>% 
 count() %>% 
 spread(var, n)

which gives,

# A tibble: 3 x 4
  val         d1    d2    d3
  <chr>    <int> <int> <int>
1 cancer       7     3     3
2 headache    NA     1     3
3 nothing      1     4     2
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • should i replace "var", "val" an so on? The code gives me this warning: Warning message: attributes are not identical across measure variables; they will be dropped. And gives me a dataframe with only "NA's" –  Jan 03 '19 at 09:44
  • The warning is no problem. It is because the columns are factors. Convert to character and the warning will go away. It has nothing to do with the only NAs. I need to see an example of the data that gives you all NA – Sotos Jan 03 '19 at 09:46
  • Can i put a screenshot of the dataframe? How can i do that in here? –  Jan 03 '19 at 09:49
  • Just use `dput(head(df))` – Sotos Jan 03 '19 at 09:54
  • I have linked a picture below' –  Jan 03 '19 at 09:56
0

This could be made more elegant, but will do the job for the reprex data and beyond:

library(tidyverse)
df <- as.tibble(table(Data$d1)) %>% 
  rename(D1 = n) %>%
  merge(as.tibble(table(Data$d2)), by = "Var1", all = TRUE) %>%
  rename(D2 = n) %>%
  merge(as.tibble(table(Data$d3)), by = "Var1", all = TRUE) %>%
  rename(D3 = n)

Result from your reprex data:

      Var1 D1 D2 D3
1   cancer  7  3  3
2 headache NA  1  3
3  nothing  1  4  2

At some point you'd probably want to wrap this into a function given the same things are being repeated.

nycrefugee
  • 1,629
  • 1
  • 10
  • 23
  • This does not generalize. If you see OP has `D1, D2, D3, D4, D5, ...`. It will be impossible to do it by hand – Sotos Jan 03 '19 at 09:55