Identify groups of repeat rows and preserve group order

Question

I am trying to organize a spreadsheet of patient data with random repeat "chunks". Unfortunately, the rows of data are repeated at random, giving me repeat "chunks." I need to remove the repeat chunks while preserving the original order.

Here is a sample:

+---------+-----+----------+
| patient | age | children |
+---------+-----+----------+
| x       | 30  | g        |
| x       | 30  | b        |
| x       | 30  | g        |
| x       | 30  | b        |
| x       | 30  | g        |
| x       | 30  | b        |
| y       | 25  | g        |
| y       | 25  | b        |
| y       | 25  | b        |
| y       | 25  | g        |
| y       | 25  | b        |
| y       | 25  | b        |
+---------+-----+----------+

You can see, patient "x" chunk (with 2 children) is repeated three times, and patient "y" chunk (with 3 children) is repeated twice. The number of repeat chunks is random.

Here is my goal: It is important that the order of the children is preserved

+---------+-----+----------+
| patient | age | children |
+---------+-----+----------+
| x       | 30  | g        |
| x       | 30  | b        |
| y       | 25  | g        |
| y       | 25  | b        |
| y       | 25  | b        |
+---------+-----+----------+

I tried this first in excel: step 1: gave all rows unique identifier, to preserve the order of the children step 2: tried to remove duplicates, but this was a problem for patient "y" who has 2 girls, the final table removed one of them...

I usually do my analysis in R, so a dplyr solution would be great here if anyone could make a suggestion

Beyond the following, I'm lost. Is there a way to recognize unique groups?

dat %>% group_by(patient)

Don't you have any other information ? For instance, we can imagine a z patient with 6 boys in the dataframe with repeat chunks. We can not know _a priori_ how many boys he has (1, 2, 3 or 6 boys, respectively 6, 3, 2 or no duplications). Should we minimise the number of children ? — ophdlv, Apr 04 '19 at 14:21
I have no other information. Unfortunately, the data export creates a random repeat of the chunks. unique(df) eliminates when a patient has 2 girls — mdb_ftl, Apr 04 '19 at 14:25
@Pelilican is right: there's no way to know if a patient with three rows of `b,b,b` are 3 repeats or just one kid. You might find some help here: https://stackoverflow.com/q/35246453/5941593 — Steven Hibble, Apr 04 '19 at 20:49

score 1 · Answer 1 · edited Apr 04 '19 at 16:39

1

The distinct() function in dplyr might be your best bet; e.g.:

dat %>% distinct()

You can find more information on identifying and removing duplicate data in R by reading this blog post.

edited Apr 04 '19 at 16:39

Cody Gray - on strike

239,200
50
490
574

answered Apr 04 '19 at 14:14

olorcain

1,230
1
9
12

Identify groups of repeat rows and preserve group order

1 Answers1