0

I am trying to remove some rows in a for loop in R. The conditional involves comparing it to the line below it, so I can't filter within the brackets.

I know that I can remove a row when a constant is specified: dataframe[-2, ]. I just want to do the same with a variable: dataframe[-x, ]. Here's the full loop:

for (j in 1:(nrow(referrals) - 1)) {
  k <- j + 1
  if (referrals[j, "Client ID"] == referrals[k, "Client ID"] & 
      referrals[j, "Provider SubCode"] == referrals[k, "Provider SubCode"]) {
    referrals[-k, ]
  }
}

The code runs without complaint, but no rows are removed (and I know some should be). Of course, if it I test it with a constant, it works fine: referrals[-2, ].

pez
  • 7
  • 6
  • 1
    Just running `referrals[-k,]` doesn't actually do anything. As with anything in R, if you want to change an object, you need to _assign_ to it, i.e. `referrals <- referrals[-k,]`. – joran Oct 21 '16 at 15:03
  • 1
    ...although, I should point out that it's not clear to me that this code will behave the way you expect even with that piece fixed. – joran Oct 21 '16 at 15:11
  • 1
    When you remove the line 'k' in one round, in the next round the 'j' will be your last 'k'. Therefore, your dataframe won't end up with the same numbers of lines you specified in the for loop range, causing a `subscript out of bounds` error. So, as @joran said, you'd consider to reformulate your code. – Fábio Oct 21 '16 at 16:09

2 Answers2

0

You need to add a reproducible example for people to work with. I don't know the structure of your data, so I can only guess if this will work for you. I would not use a loop, for the reasons pointed out in the comments. I would identify the rows to remove first, and then remove them using normal means. Consider:

set.seed(4499)  # this makes the example exactly reproducible
d <- data.frame(Client.ID        = sample.int(4, 20, replace=T),
                Provider.SubCode = sample.int(4, 20, replace=T))
d
#    Client.ID Provider.SubCode
# 1          1                1
# 2          1                4
# 3          3                2
# 4          4                4
# 5          4                1
# 6          2                2
# 7          2                2  # redundant
# 8          3                1
# 9          4                4
# 10         3                4
# 11         1                3
# 12         1                3  # redundant
# 13         3                4
# 14         1                2
# 15         3                2
# 16         4                4
# 17         3                4
# 18         2                2
# 19         4                1
# 20         3                3
redundant.rows <- with(d, Client.ID[1:nrow(d)-1]==Client.ID[2:nrow(d)] &
                          Provider.SubCode[1:nrow(d)-1]==Provider.SubCode[2:nrow(d)] )
d[-c(which(redundant.rows)+1),]
#    Client.ID Provider.SubCode
# 1          1                1
# 2          1                4
# 3          3                2
# 4          4                4
# 5          4                1
# 6          2                2
# 8          3                1  # 7 is missing
# 9          4                4
# 10         3                4
# 11         1                3
# 13         3                4  # 12 is missing
# 14         1                2
# 15         3                2
# 16         4                4
# 17         3                4
# 18         2                2
# 19         4                1
# 20         3                3
Community
  • 1
  • 1
gung - Reinstate Monica
  • 11,583
  • 7
  • 60
  • 79
0

Using all information given by you, I believe this could be a good alternative:

duplicated.rows <- duplicated(referrals)

Then, if you want the duplicated results run:

referrals.double <- referrals[duplicated.rows, ]

However, if you want the non duplicated results run:

referrals.not.double <- referrals[!duplicated.rows, ]

If you prefer to go step by step (maybe it's interesting for you):

duplicated.rows.Client.ID <- duplicated(referrals$"Client ID")

duplicated.rows.Provider.SubCode <- duplicated(referrals$"Provider SubCode")

referrals.not.double <- referrals[!duplicated.rows.Client.ID, ]

referrals.not.double <- referrals.not.double[!duplicated.rows.Client.ID, ]
Fábio
  • 771
  • 2
  • 14
  • 25
  • 1
    This is similar to what I put. But it isn't clear that the two variables discussed are the only variables in the dataset, so it isn't clear that `duplicated()` will work for the OP. A reproducible example would help clarify things. – gung - Reinstate Monica Oct 22 '16 at 01:27