I am trying to organize a spreadsheet of patient data with random repeat "chunks". Unfortunately, the rows of data are repeated at random, giving me repeat "chunks." I need to remove the repeat chunks while preserving the original order.
Here is a sample:
+---------+-----+----------+
| patient | age | children |
+---------+-----+----------+
| x | 30 | g |
| x | 30 | b |
| x | 30 | g |
| x | 30 | b |
| x | 30 | g |
| x | 30 | b |
| y | 25 | g |
| y | 25 | b |
| y | 25 | b |
| y | 25 | g |
| y | 25 | b |
| y | 25 | b |
+---------+-----+----------+
You can see, patient "x" chunk (with 2 children) is repeated three times, and patient "y" chunk (with 3 children) is repeated twice. The number of repeat chunks is random.
Here is my goal: It is important that the order of the children is preserved
+---------+-----+----------+
| patient | age | children |
+---------+-----+----------+
| x | 30 | g |
| x | 30 | b |
| y | 25 | g |
| y | 25 | b |
| y | 25 | b |
+---------+-----+----------+
I tried this first in excel: step 1: gave all rows unique identifier, to preserve the order of the children step 2: tried to remove duplicates, but this was a problem for patient "y" who has 2 girls, the final table removed one of them...
I usually do my analysis in R, so a dplyr solution would be great here if anyone could make a suggestion
Beyond the following, I'm lost. Is there a way to recognize unique groups?
dat %>% group_by(patient)