Subsetting data.table based on repeated rows

Question

I have a list of data tables stored in an object ddf (a sample is shown below):

 [[43]]
    V1 V2 V3
1:  b  c  a
2:  b  c  a
3:  b  c  a
4:  b  c  a
5:  b  b  a
6:  b  c  a
7:  b  c  a

 [[44]]
   V1 V2 V3
1:  a  c  a
2:  a  c  a
3:  a  c  a
4:  a  c  a
5:  a  c  a

 [[45]]
   V1 V2 V3
1:  a  c  b
2:  a  c  b
3:  a  c  b
4:  a  c  b
5:  a  c  b
6:  a  c  b
7:  a  c  b
8:  a  c  b
9:  a  c  b
               .............and so on till [[100]]

I want to Subset the list ddf such that the result only consists of ddf's which:

have at least 9 rows each
each of the 9 rows are same
I want to store this sub-setted output

I have written some code for this below:

 for(i in 1:100){
 m=(as.numeric(nrow(df[[i]]))>= 9)
 if(m == TRUE & df[[i]][1,] = df[[i]][2,] = 
 =df[[i]][3,] =df[[i]][4,] =df[[i]][5,] =df[[i]][6,]=
 df[[i]][7,]=df[[i]][8,]=df[[i]][9,]){
 print(df[[i]])
 }}

Please tell me whats wrong & how I can generalize the result for sub-setting based on "n" similar rows.

[Follow-up Question]

    Answer obtained from Main question:
    > ddf[sapply(ddf, function(x) nrow(x) >= n & nrow(unique(x)) == 1)]
      $`61`
         V1 V2 V3
      1:  a  c  b
      2:  a  c  b
      3:  a  c  b
      4:  a  c  b
      5:  a  c  b
      6:  a  c  b
      7:  a  c  b

      $`68`
         V1 V2 V3
      1:  a  c  a
      2:  a  c  a
      3:  a  c  a
      4:  a  c  a
      5:  a  c  a
      6:  a  c  a
      7:  a  c  a
      8:  a  c  a

      $`91`
         V1 V2 V3
      1:  b  c  a
      2:  b  c  a
      3:  b  c  a
      4:  b  c  a
      5:  b  c  a
      6:  b  c  a
      7:  b  c  a

               ..... till the last data.frame which meet the row matching criteria (of at least 9 similar rows)

      There are only 2 types of elements in the list: 
                  **[[.. ]]**        
     **Case 1.** >70% accuracy       
     **Case 2.** <70% accuracy

You will notice that the Output shown above in the "Follow Up Question" is for

$'61', $'68' & $'91', but there is NO output for the other dataframes which don't match the "matching row" criteria.

I need an output where these missing values which don't match the output criteria give an output of "bad output".

Thus the Final list should be the same length as the input list.

By placing them side-by-side using paste I should be able to see each output.

Please read the info on how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610) — Jaap, Mar 23 '17 at 08:41
Ideally, we would be able to copy/paste all of your code into R and start tinkering. The data you provide isn't all that suited for import. See the link Jaap provided on how to effortlessly share (simulated) data. — Roman Luštrik, Mar 23 '17 at 08:47

akrun · Accepted Answer · 2017-03-23T09:03:16.757

2

We can loop through the list ('ddf'), subset only the duplicate rows with (duplicated), order the dataset, if the number of rows of the dataset 'x1' is greater than 8, then get the first 9 rows (head(x1, 9)) or else return 'bad result' printed

lapply(ddf, function(x) {
  x1 <- x[duplicated(x)|duplicated(x, fromLast=TRUE)]
if(nrow(x1)>9) {
 x1[order(V1, V2, V3), head(.SD, 9)] 

  } else "bad answer"
 })
#[[1]]
#   V1 V2 V3
#1:  b  c  a
#2:  b  c  a
#3:  b  c  a
#4:  b  c  a
#5:  b  c  a
#6:  b  c  a
#7:  b  c  a
#8:  b  c  a
#9:  b  c  a

#[[2]]
#[1] "bad answer"

#[[3]]
#[1] "bad answer"

data

ddf <- list(data.table(V1 = 'b', V2 = rep(c('c', 'b', 'c'), c(8, 1, 2)), V3 = 'a'),
       data.table(V1 = rep("a", 5), V2 = rep("c", 5), V3 = rep("a", 5)),
       data.table(V1 = c('b', 'a', 'b', 'b'), V2 = c('b', 'a', 'c', 'b'),
       V3 = c("c", "d", "a", "b")))

edited Mar 23 '17 at 09:03

answered Mar 23 '17 at 08:50

akrun

874,273
37
540
662

I get a function error: 2nd argument must be a list – rtaero Mar 23 '17 at 08:59
function (what, args, quote = FALSE, envir = parent.frame()) { if (!is.list(args)) stop("second argument must be a list") if (quote) args <- lapply(args, enquote) .Internal(do.call(what, args, envir)) } – rtaero Mar 23 '17 at 08:59
@Rishi Corrected the problem – akrun Mar 23 '17 at 09:03
@arun I'm getting an incorrect answer for the subsetting, I am getting one data.table outputs when I should be getting two – rtaero Mar 23 '17 at 09:13
@Rishi I added an example data in my post and is working fine with it – akrun Mar 23 '17 at 09:13
1

Yes @akrun Thank You – rtaero Mar 23 '17 at 09:42
there is an issue: If the data has 2 sets of repeated values then I get an incorrect result. – rtaero Mar 24 '17 at 02:40
@RishiTandon My logic was that there would be many repeat values and it selects the first one after ordering. – akrun Mar 24 '17 at 02:42
How can I coerce this output: ddf[sapply(ddf, nrow) >= 9 & sapply(ddf, function(x) nrow(unique(x))) == 1] with a loop :: for the errors I get "bad result" – rtaero Mar 24 '17 at 04:52
@rtaero I think your question should be directed towards the other answer which uses that code – akrun Mar 24 '17 at 05:29
the loop is more important for my case, please share a code for filtering the data.frame's in the list 'ddf' based on a minimum of "n" similar, unique rows. – rtaero Mar 27 '17 at 05:10
@rtaero I think what you may need is `lapply(ddf, function(x) if(nrow(x) >= n & nrow(unique(x)) == 1) x else "Bad output")` – akrun Mar 27 '17 at 06:49

Jaap · Answer 2 · 2017-11-04T15:44:46.077

2

When ddf is your list of datatables, then:

ddf[sapply(ddf, nrow) >= 9 & sapply(ddf, function(x) nrow(unique(x))) == 1]

should give you the desired result.

Where:

sapply(ddf, nrow) >= 9 checks whether the datatables have nine or more rows
sapply(ddf, function(x) nrow(unique(x))) == 1 checks whether all the rows are the same.

Or with one sapply call as @docendodiscimus suggested:

ddf[sapply(ddf, function(x) nrow(x) >= 9 & nrow(unique(x)) == 1)]

Or by using the .N special symbol and the uniqueN function of data.table:

ddf[sapply(ddf, function(x) x[,.N] >= 9 & uniqueN(x) == 1)]

Another option is to use Filter (following the suggestion of @Frank in the comments):

Filter(function(x) nrow(x) >= 9 & uniqueN(x) == 1, ddf)

Two approaches to get the datatable numbers:

1. Using which:

which(sapply(ddf, function(x) nrow(x) >= 9 & nrow(unique(x)) == 1))

2. Assign names to the datatables in the list:

names(ddf) <- paste0('dt', 1:length(ddf))

now the output will have the datatable number in the output:

$dt4
  V1 V2 V3
1  a  c  b
2  a  c  b
3  a  c  b
4  a  c  b
5  a  c  b
6  a  c  b
7  a  c  b
8  a  c  b
9  a  c  b

edited Nov 04 '17 at 15:44

answered Mar 23 '17 at 08:50

Jaap

81,064
34
182
193

1

Can't you do both check in one `sapply` call? – talat Mar 23 '17 at 09:01
@Jaap - I need the datatable number to appear in the output .. e.g. lst[[11]] ..., lst[[15]], assuming they meet the criteria of subsetting – rtaero Mar 23 '17 at 09:12
@Jaap - I need an output as a list :: for an inaccurate result I get "< 70% accuracy". For the correct output I get the datatable. This may need a for-loop – rtaero Mar 24 '17 at 03:07
@rtaero The methods I showed are returning a list and giving the correct output. Please explain your problem in more detail. You could do that by adding an extra section to your question where you explain this. – Jaap Mar 24 '17 at 07:20
@Jaap I have added "[Follow-up Question]" in the main Question, for the benefit of the community – rtaero Mar 24 '17 at 08:09
@rtaero please [see here for instructions on how to make the question reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610) (now it isn't); you could for example include the output of `dput(head(name_of_list))` – Jaap Mar 24 '17 at 08:24
@Jaap Please help me answer this extended part of the question, have included changes to code based on suggestions – rtaero Mar 25 '17 at 09:52
Sir, would really appreciate your help in this @Jaap, I have tried many combinations of loops and tried to run this, but it doesn't work – rtaero Mar 27 '17 at 04:39
@rtaero for the output you need for the followup question, I think akrun already provided you with a good solution – Jaap Mar 27 '17 at 06:07
Need to clarify the answer for the benefit of the community – AzizSM Nov 03 '17 at 07:50
@Azi Would you care to explain the downvote (since it came shortly after my last comment, I guess it was you)? At the beginning of my answer I've explained what the two parts do. If you think it could be improved, please explain how. – Jaap Nov 04 '17 at 15:47

Subsetting data.table based on repeated rows

2 Answers2

data