0

I am trying to organize a spreadsheet of patient data with random repeat "chunks". Unfortunately, the rows of data are repeated at random, giving me repeat "chunks." I need to remove the repeat chunks while preserving the original order.

Here is a sample:

+---------+-----+----------+
| patient | age | children |
+---------+-----+----------+
| x       | 30  | g        |
| x       | 30  | b        |
| x       | 30  | g        |
| x       | 30  | b        |
| x       | 30  | g        |
| x       | 30  | b        |
| y       | 25  | g        |
| y       | 25  | b        |
| y       | 25  | b        |
| y       | 25  | g        |
| y       | 25  | b        |
| y       | 25  | b        |
+---------+-----+----------+

You can see, patient "x" chunk (with 2 children) is repeated three times, and patient "y" chunk (with 3 children) is repeated twice. The number of repeat chunks is random.

Here is my goal: It is important that the order of the children is preserved

+---------+-----+----------+
| patient | age | children |
+---------+-----+----------+
| x       | 30  | g        |
| x       | 30  | b        |
| y       | 25  | g        |
| y       | 25  | b        |
| y       | 25  | b        |
+---------+-----+----------+

I tried this first in excel: step 1: gave all rows unique identifier, to preserve the order of the children step 2: tried to remove duplicates, but this was a problem for patient "y" who has 2 girls, the final table removed one of them...

I usually do my analysis in R, so a dplyr solution would be great here if anyone could make a suggestion

Beyond the following, I'm lost. Is there a way to recognize unique groups?

dat %>% group_by(patient)
mdb_ftl
  • 423
  • 2
  • 14
  • 1
    have you tried `unique(df)` ? – moodymudskipper Apr 04 '19 at 14:13
  • Don't you have any other information ? For instance, we can imagine a z patient with 6 boys in the dataframe with repeat chunks. We can not know _a priori_ how many boys he has (1, 2, 3 or 6 boys, respectively 6, 3, 2 or no duplications). Should we minimise the number of children ? – ophdlv Apr 04 '19 at 14:21
  • I have no other information. Unfortunately, the data export creates a random repeat of the chunks. unique(df) eliminates when a patient has 2 girls – mdb_ftl Apr 04 '19 at 14:25
  • could there be a way to use rle() to do this? – mdb_ftl Apr 04 '19 at 15:10
  • @Pelilican is right: there's no way to know if a patient with three rows of `b,b,b` are 3 repeats or just one kid. You might find some help here: https://stackoverflow.com/q/35246453/5941593 – Steven Hibble Apr 04 '19 at 20:49

1 Answers1

1

The distinct() function in dplyr might be your best bet; e.g.:

dat %>% distinct()

You can find more information on identifying and removing duplicate data in R by reading this blog post.

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
olorcain
  • 1,230
  • 1
  • 9
  • 12