0

I'm relatively new to R, so I apologize if this question has already been answered and I was just unable to find it because I don't know what to search for.

I have a very large health-related dataset where there is a row for each doctor's visit that occurred in some specific period of time. One of the variables in the dataset is a patient ID. It's clear that there's a huge amount of duplication of patient IDs in the dataset--I have about half a million unique patient IDs and about 9 million observations. Other variables in the data are diseaseA, diseaseB, and diseaseC. Each of these is a 0 or 1--0 meaning the patient did not have the disease at the time of the visit and 1 meaning they did have the disease. One last important variable is race/ethnicity.

What I ultimately want is:

  1. A way to count the number of patients who ever had each disease during a doctor's visit. So, for example, if there is a 1 for diseaseA in any of the observations where patient ID == 1, that patient would be counted as ever having diseaseA.
  2. Among the people who ever had each disease, a way to count how many times they went to the doctor's office with that disease so that I can compare whether the average number differs by the race/ethnicity variable.

I'm quite lost as to how to do both of these. Maybe something using group_by and summarize together? I wonder if there's a solution related to the reshape package that would help with the second goal. I'm just not familiar enough with R to get much farther than that, though, so I'd love some help with this.

Here is some sample data with similar characteristics as the actual data:

structure(list(patient_id = c(2L, 1L, 1L, 1L, 4L, 1L, 2L, 5L, 
5L, 1L, 2L, 1L, 2L, 4L, 5L, 2L, 4L, 2L, 1L, 3L, 3L, 1L, 2L, 1L, 
4L, 5L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 1L, 3L, 3L, 4L, 
4L, 1L, 2L, 5L, 5L, 2L, 2L, 2L, 1L, 2L, 2L, 4L, 1L, 3L, 2L, 5L, 
4L, 3L, 3L, 1L, 2L, 1L, 2L, 5L, 5L, 4L, 4L, 4L, 4L, 4L, 2L, 4L, 
4L, 3L, 4L, 5L, 4L, 3L, 4L, 5L, 1L, 5L, 4L, 1L, 3L, 1L, 1L, 3L, 
5L, 2L), diseaseA = c(1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 
0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 
1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 
0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1), diseaseB = c(0, 1, 
0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 
0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 
0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 
1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 
1, 0, 1, 0), diseaseC = c(0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 
1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 
1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 
1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 
0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0), raceeth = structure(c(2L, 
2L, 2L, 2L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 
2L, 2L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
1L, 1L, 1L, 2L, 1L, 1L, 3L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 
2L, 2L, 3L, 2L, 1L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 2L, 
3L, 3L, 2L, 1L, 2L, 2L, 1L, 3L, 2L), .Label = c("Hispanic", "NH White", 
"NH Black"), class = "factor")), row.names = c(NA, 90L), class = "data.frame")

Let me know if there's any additional information I can provide that would help.

eliseabril
  • 65
  • 6
  • 1
    how can a patient be hispanic, NH white and NH black at the same time ? – Samet Sökel Sep 08 '21 at 21:20
  • 1
    My guess is that this comes down to summarizing by "group", where the grouping variable might be `patient_id` (think "group" as in a group of data rows, not a group of patients). If that's the case, see https://stackoverflow.com/q/11562656/3358272, that link provides multiple ways to manipulate and aggregate your data by `patient_id` and/or `raceeth`. – r2evans Sep 08 '21 at 21:27
  • @SametSökel Oh wow, that's quite a mistake on my part. I've updated the data now! Thanks for catching that. – eliseabril Sep 08 '21 at 21:36

1 Answers1

0

I calculated for each patient's diseases (and I think a patient keeps stick to their racceth, that's why grouping by patient_id and raceeth should be ok, I grouped by two variables because I need to keep race column also)

library(dplyr)

df2 <- df %>%
group_by(patient_id,raceeth) %>%
summarise_all(sum) %>%
ungroup

df2

output;

  patient_id raceeth  diseaseA diseaseB diseaseC
       <int> <fct>       <dbl>    <dbl>    <dbl>
1          1 NH White       12        8       11
2          2 NH White       11       11       10
3          3 Hispanic        6        4        7
4          4 NH Black       12        8        8
5          5 NH Black        3        8        6

Now I can calculate averages for each race group ;

df3 <- df2 %>%
select(-patient_id) %>%
group_by(raceeth) %>%
summarise_all(mean)%>%
ungroup

df3

output;

  raceeth  diseaseA diseaseB diseaseC
  <fct>       <dbl>    <dbl>    <dbl>
1 Hispanic      6        4        7  
2 NH White     11.5      9.5     10.5
3 NH Black      7.5      8        7  
Samet Sökel
  • 2,515
  • 6
  • 21
  • 1
    This worked just great! I didn't know about ungrouping, so I think that's where I was stumbling. After creating df2 here, to create a variable noting whether each patient ever had the disease, I just used an ifelse statement. Thank you! – eliseabril Sep 13 '21 at 18:50