Multirow deletion: delete row depending on other row

Question

I'm stuck with a quite complex problem. I have a data frame with three rows: id, info and rownum. The data looks like this:

id   info   row
 1      a     1
 1      b     2
 1      c     3
 2      a     4
 3      b     5
 3      a     6
 4      b     7
 4      c     8

What I want to do now is to delete all other rows of one id if one of the rows contains the info a. This would mean for example that row 2 and 3 should be removed as row 1's coloumn info contains the value a. Please note that the info values are not ordered (id 3/row 5 & 6) and cannot be ordered due to other data limitations.

I solved the case using a for loop:

# select all id containing an "a"-value 
a_val <- data$id[grep("a", data$info)]

# check for every id containing an "a"-value
for(i in a_val) {

   temp_data <- data[which(data$id == i),]

   # only go on if the given id contains more than one row
   if (nrow(temp_data) > 1) {

      for (ii in nrow(temp_data)) {

         if (temp_data$info[ii] != "a") {
            temp <- temp_data$row[ii]

            if (!exists("delete_rows")) {
               delete_rows <- temp
            } else {
               delete_rows <- c(delete_rows, temp)
            }
         }
      }
   }
}

My solution works quite well. Nevertheless, it is very, very, very slow as the original data contains more than 700k rows and more that 150k rows with an "a"-value.

I could use a foreach loop with 4 cores to speed it up, but maybe someone could give me a hint for a better solution.

Best regards,
Arne

[UPDATE]

The outcome should be:

id   info   row
 1      a     1
 2      a     4
 3      a     6
 4      b     7
 4      c     8

So if I understand correctly, in your example the only remaining rows are 7 and 8? Does something like `data[!(data$id %in% unique(data$id[data$info=="a"])),]` solve your problem? — sebastian-c, Nov 26 '12 at 11:29
You should provide a reproducible example http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Roman Luštrik, Nov 26 '12 at 11:34
@sebastian-c And the result should be that the only left rows are 1,4,6,7,8) — ahs85, Nov 26 '12 at 11:36
Ah hah, so all the ids with "a" except those only containing "a". — sebastian-c, Nov 26 '12 at 11:40

Sven Hohenstein · Accepted Answer · 2012-11-26T16:57:10.660

2

Here is one possible solution.

First find ids where info contains "a":

ids <- with(data, unique(id[info == "a"]))

Subset the data:

subset(data, (id %in% ids & info == "a") | !id %in% ids)

Output:

  id info row
1  1    a   1
4  2    a   4
6  3    a   6
7  4    b   7
8  4    c   8

An alternative solution (maybe harder to decipher):

subset(data, info == "a" | !rep.int(tapply(info, id, function(x) any(x == "a")),
                                    table(id)))

Note. @BenBarnes found out that this solution only works if the data frame is ordered according to id.

edited Nov 26 '12 at 16:57

answered Nov 26 '12 at 11:37

Sven Hohenstein

80,497
17
145
168

1

I don't think so. OP stated he wanted rows 1, 4, 6, 7, 8. – sebastian-c Nov 26 '12 at 11:39
@sebastian-c right, please look above to the updated orifinal post – ahs85 Nov 26 '12 at 11:42
@ahs85 I modified my answer. – Sven Hohenstein Nov 26 '12 at 12:32
Works like a charm and is **very** fast! – ahs85 Nov 26 '12 at 16:28
1

@SvenHohenstein, it looks like your alternative solution requires that `data` be ordered according to `id`. – BenBarnes Nov 26 '12 at 16:43
@BenBarnes + 1 Thank you very much for the hint. I added an comment to the alternative solution. – Sven Hohenstein Nov 26 '12 at 16:57

BenBarnes · Answer 2 · 2012-11-26T15:10:06.280

You might want to investigate the data.table package:

EDIT: If the row variable is not a sequential numbering of each row in your data (as I assumed it was), you could create such a variable to obtain the original row order:

library(data.table)
# Create data.table of your data
dt <- as.data.table(data)
# Create index to maintain row order
dt[, idx := seq_len(nrow(dt))]
# Set a key on id and info
setkeyv(dt, c("id", "info"))
# Determine unique ids
uid <- dt[, unique(id)]
# subset your data to select rows with "a"
dt2 <- dt[J(uid, "a"), nomatch = 0]
# identify rows of dataset where the id doesn't have an "a"
dt3 <- dt[J(dt2[, setdiff(uid, id)])]
# rbind those two data.tables together
(dt4 <- rbind(dt2, dt3))

#    id info row idx
# 1:  1    a   1   1
# 2:  2    a   4   4
# 3:  3    a   6   6
# 4:  4    b   7   7
# 5:  4    c   8   8

# And if you need the original ordering of rows,
dt5 <- dt4[order(idx)]

Note that setting a key for the data.table will order the rows according to the key columns. The last step (creating dt5) sets the row order back to the original.

+1 It feels like the new not-join syntax should be used (`DT[!"a"]`), but I didn't fully follow the 2 column aspect of the question after a quick glance. — Matt Dowle, Nov 26 '12 at 12:21
@MatthewDowle, good suggestion with not-join. I usually hang a bit behind the most up-to-date package version, so that option is still very new territory for me. I'll have a look, though. — BenBarnes, Nov 26 '12 at 13:22

score 1 · Answer 3 · answered Nov 26 '12 at 11:44

Here is a way using ddply:

df <- read.table(text="id   info   row
 1      a     1
 1      b     2
 1      c     3
 2      a     4
 3      b     5
 3      a     6
 4      b     7
 4      c     8",header=TRUE)


library("plyr")
ddply(df,.(id),subset,rep(!'a'%in%info,length(info))|info=='a')

Returns:

  id info row
1  1    a   1
2  2    a   4
3  3    a   6
4  4    b   7
5  4    c   8

score 0 · Answer 4 · answered Nov 26 '12 at 16:47

if df is this (RE Sacha above) use match which just finds the index of the first occurrence:

df <- read.table(text="id   info   row
 1      a     1
 1      b     2
 1      c     3
 2      a     4
 3      b     5
 3      a     6
 4      b     7
 4      c     8",header=TRUE)


# the first info row matching 'a' and all other rows that are not 'a'
with(df, df[c(match('a',info), which(info != 'a')),])

  id info row
1  1    a   1
2  1    b   2
3  1    c   3
5  3    b   5
7  4    b   7
8  4    c   8

score -1 · Answer 5 · answered Nov 26 '12 at 11:22

-1

try to take a look at subset, it's quite easy to use and it will solve your problem.

you just need to specify the value of the column that you want to subset based on, alternatively you can choose more columns.

http://stat.ethz.ch/R-manual/R-devel/library/base/html/subset.html

http://www.statmethods.net/management/subset.html

answered Nov 26 '12 at 11:22

ifreak

1,726
4
27
45

I tried subset but how can I get it to recognize rows on the lines above and/or below? – ahs85 Nov 26 '12 at 11:37
I didn't do it, but perhaps because this is more of a comment (in the lines of RTFM) and less of an answer. – A5C1D2H2I1M1N2O1R2T1 Nov 26 '12 at 11:42

Multirow deletion: delete row depending on other row

5 Answers5