I'm relatively new to R, so I apologize if this question has already been answered and I was just unable to find it because I don't know what to search for.
I have a very large health-related dataset where there is a row for each doctor's visit that occurred in some specific period of time. One of the variables in the dataset is a patient ID. It's clear that there's a huge amount of duplication of patient IDs in the dataset--I have about half a million unique patient IDs and about 9 million observations. Other variables in the data are diseaseA, diseaseB, and diseaseC. Each of these is a 0 or 1--0 meaning the patient did not have the disease at the time of the visit and 1 meaning they did have the disease. One last important variable is race/ethnicity.
What I ultimately want is:
- A way to count the number of patients who ever had each disease during a doctor's visit. So, for example, if there is a 1 for diseaseA in any of the observations where patient ID == 1, that patient would be counted as ever having diseaseA.
- Among the people who ever had each disease, a way to count how many times they went to the doctor's office with that disease so that I can compare whether the average number differs by the race/ethnicity variable.
I'm quite lost as to how to do both of these. Maybe something using group_by and summarize together? I wonder if there's a solution related to the reshape package that would help with the second goal. I'm just not familiar enough with R to get much farther than that, though, so I'd love some help with this.
Here is some sample data with similar characteristics as the actual data:
structure(list(patient_id = c(2L, 1L, 1L, 1L, 4L, 1L, 2L, 5L,
5L, 1L, 2L, 1L, 2L, 4L, 5L, 2L, 4L, 2L, 1L, 3L, 3L, 1L, 2L, 1L,
4L, 5L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 1L, 3L, 3L, 4L,
4L, 1L, 2L, 5L, 5L, 2L, 2L, 2L, 1L, 2L, 2L, 4L, 1L, 3L, 2L, 5L,
4L, 3L, 3L, 1L, 2L, 1L, 2L, 5L, 5L, 4L, 4L, 4L, 4L, 4L, 2L, 4L,
4L, 3L, 4L, 5L, 4L, 3L, 4L, 5L, 1L, 5L, 4L, 1L, 3L, 1L, 1L, 3L,
5L, 2L), diseaseA = c(1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1), diseaseB = c(0, 1,
0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0,
0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0,
1, 0, 1, 0), diseaseC = c(0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0,
1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1,
0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0), raceeth = structure(c(2L,
2L, 2L, 2L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 3L,
2L, 2L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 2L, 1L, 1L, 3L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 2L, 1L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 2L,
3L, 3L, 2L, 1L, 2L, 2L, 1L, 3L, 2L), .Label = c("Hispanic", "NH White",
"NH Black"), class = "factor")), row.names = c(NA, 90L), class = "data.frame")
Let me know if there's any additional information I can provide that would help.