0

I have a dataset in R with information about individuals and diagnoses. The variables are group, age, weight, id and diagnosis. So an individual can have one row with X in diagnosis (meaning no diagnosis) or one or more rows with diagnoses. Now I want to make a new variable with the number of diagnoses each individual got so that each individual has one row in the dataset with the variables group, age, weight, id and number of diagnoses. In this new column with diagnosis I want individuals with no diagnosis to get the number 0, with one diagnosis the number 1, with two diagnoses the number 2 and etcetera. Can anyone help me?

I am using R. I tried to use group_by and count but I can not get the number 0 for individuals with no diagnosis (X in the diagnosis column) and I can not see the other variables like group, age and weight.

Here is the data:

pr <- read_csv("~/Desktop/Data.csv")
head(pr)

Data

dput(pr)

structure(list(GROUP = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 
3, 3, 3, 3, 4, 4, 4), AGE = c(23, 34, 61, 23, 45, 34, 34, 55, 
56, 43, 56, 49, 61, 49, 74, 49, 51, 46, 75), WEIGHT = c(56, 72, 
70, 56, 101, 72, 72, 62, 60, 78, 60, 55, 79, 55, 89, 55, 67, 
60, 105), ID = c(4, 1, 2, 4, 3, 1, 1, 5, 7, 6, 7, 8, 9, 8, 10, 
8, 11, 12, 13), DIAGNOSIS = c("J01", "J01", "X", "J01", "J01", 
"J01", "J01", "J01", "J01", "J01", "J01", "J01", "X", "J01", 
"J01", "J01", "X", "J01", "J01")), class = c("spec_tbl_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -19L), spec = structure(list(
    cols = list(GROUP = structure(list(), class = c("collector_double", 
    "collector")), AGE = structure(list(), class = c("collector_double", 
    "collector")), WEIGHT = structure(list(), class = c("collector_double", 
    "collector")), ID = structure(list(), class = c("collector_double", 
    "collector")), DIAGNOSIS = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

Picture of the desired output: Desired output

  • 2
    Welcome to SO! Please see [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). You can provide data via the output of `dput(df)`. But we only need a minimal example, so if you have a big dataset, then you can do `dput(head(df))`. Providing a minimal reproducible example and expected output helps increase chances of someone being able to help. You can just edit your original question and provide the dput results there. – AndrewGB Mar 29 '22 at 16:01
  • 1
    NA are contagious, your code looks into a row and then 'sees' nothing, adding a NA to the new column and replacing all other values in that row to a NA. Advice on how to help/fix your code will require a Reprex like @AndrewGB said. – Omniswitcher Mar 29 '22 at 16:05

1 Answers1

0

One way to approach this is to group_by multiple columns, if the information is repeated for a given individual (which it does in this example). You will get these columns in your results in the end. Also, you can summarise where the DIAGNOSIS is not "X" - instead of count, so that you will get zero for cases where DIAGNOSIS is "X".

library(dplyr)

pr %>%
  group_by(GROUP, ID, AGE, WEIGHT) %>%
  summarise(NUMBER = sum(DIAGNOSIS != "X"))

Output

   GROUP    ID   AGE WEIGHT NUMBER
   <dbl> <dbl> <dbl>  <dbl>  <int>
 1     1     1    34     72      3
 2     1     2    61     70      0
 3     1     3    45    101      1
 4     1     4    23     56      2
 5     2     5    55     62      1
 6     2     6    43     78      1
 7     2     7    56     60      2
 8     3     8    49     55      3
 9     3     9    61     79      0
10     3    10    74     89      1
11     4    11    51     67      0
12     4    12    46     60      1
13     4    13    75    105      1
Ben
  • 28,684
  • 5
  • 23
  • 45