2

I am trying to isolate entries in a dataframe which share common values: see below to reconstruct a portion of my df:

Stand<-c("MY","MY","MY","MY","MY")
Plot<-c(12,12,12,12,12)
StumpNumber<-c(1,2,3,3,7)
TreeNumber<-c(1,2,3,4,8)
sample<-data.frame(Stand,Plot,StumpNumber,TreeNumber)
sample

And get an output that tells me which entries have common values. In other words, to quickly isolate situations where there is more than one TreeNumber (or more than one row) for a given Stand,Plot,StumpNumber combination. In the example code that would be that StumpNumber 3 has TreeNumber 3 and TreeNumber 4.

My understanding of duplicated() is that can find instances where duplicated values occur within a single column- what can I do to find situations where a common combination of columns occurs?

Thanks.

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
Nan
  • 446
  • 4
  • 14
  • 1
    No, what's unique is the combination of Stand, Plot, StumpNumber, and TreeNumber. I'm a forester and what I'm looking at is situations where multiple trees come from one stump (i.e., trees growing in a clump). – Nan Nov 10 '10 at 04:00
  • I wasn't clear. I was asking if TreeNumber is unique _within_ Stand, Plot, StumpNumber. – Joshua Ulrich Nov 10 '10 at 14:33

2 Answers2

5

The Description of ?duplicated indicates that it works on rows of data.frames and the fourth paragraph of the Details section says:

 The data frame method works by pasting together a character
 representation of the rows separated by ‘\r’, so may be imperfect
 if the data frame has characters with embedded carriage returns or
 columns which do not reliably map to characters.

How did you come to understand that it only works on single columns?

Assuming TreeNumber is unique within Stand, Plot, and StumpNumber you just need to exclude it from the call to duplicated.

> duplicated(sample[,1:3])
[1] FALSE FALSE FALSE  TRUE FALSE
> duplicated(sample[,1:3], fromLast=TRUE)
[1] FALSE FALSE  TRUE FALSE FALSE

Update - If you would like all the duplicated rows, you could do something like:

> allDups <- duplicated(sample[,1:3],fromLast=TRUE) | duplicated(sample[,1:3])
> sample[allDups,]
  Stand Plot StumpNumber TreeNumber
3    MY   12           3          3
4    MY   12           3          4
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
0

For convenience, I'm going to assume you have a nesting scheme going on. So, let's say Trees are nested in Stumps, Stumps in Plots, and Plots in Stands. I also assumed the problem you're trying to solve is that some trees are attached to the same stump, which means the problematic entries are those where Stand/Plot/Stump identifiers are repeated for different TreeNumbers

What I did was:

  • Order the data
  • Wrap a slightly customized function around duplicated()
  • Use ddply() (in the plyr package) to split and analyze your data
  • Print out the problematic entries

Ordering the Data

I ordered first by Stand, then Plot, and finally StumpNumber

    sampleOrdered <- sample[order(sample$Stand, sample$Plot, sample$StumpNumber)]

Wrapping my own duplicated() function

Assuming the issue is that some trees are attached to the same stump, we can write the following function:

    findTreesAttachedToTheSameStump <- function(data) {
        x <- duplicated(data[ , "StumpNumber"])
        data[x, ]
    }

This function will select out and return (implicitly) whatever entries pass the duplicated() test.

Using ddply

I did a bit of split-apply-combine here. I instruct ddply to break the dataset by Stand and Plot (since your data is nested, and StumpNumber might only be unique within a plot). Then, I apply the function we created above:

    sampleDuplicated <- ddply(sampleOrdered, .(Stand, Plot), findTreesAttachedToTheSameStump)

Print out the problematic stumps

Now all we need to do is call sampleDuplicated, which contains the entries for every Plot/Stand/Stump combination that was repeated.

briandk
  • 6,749
  • 8
  • 36
  • 46
  • 1
    ...or you could use `sample[duplicated(sample[,1:3]),]` – Joshua Ulrich Nov 10 '10 at 22:57
  • @Joshua - Do you mean that would work in place of my solution? I tried just running your line, but I get this output: `<0 rows> (or 0-length row.names)`. Can you clarify your suggestion? – briandk Nov 11 '10 at 00:03
  • hmm, I'm not sure what to tell you. It "works for me" (tm) in a fresh session on a fresh install of R-2.12.0. I just run the 5 lines from Nan's question and the one line from my comment to get the same results as your answer. – Joshua Ulrich Nov 11 '10 at 01:24