Partitioning Data in 'R' based on data size

Question

I'm currently working on a program that analyzes leaf area and compares that to the position of the leaf within the cluster (i.e. is it the first leaf, 3rd, last. etc.) and am analyzing the relationship between the position, area, mass, and more. I have a database of approximately 5,000 leaves, and 1,000 clusters and that's where the problem arises.

Clusters come in different numbers, most have 5 leaves, but some have 2, 8, or anywhere in-between. I need a way to separate the clusters by number in the cluster so that the program isn't treating clusters with 3 leaves the same as clusters with 7. My .csv has each leaf individually entered so simply manually input different sets isn't possible.

I'm rather new at 'R' so I might be missing an obvious skill here but any help would be greatly appreciated. I also understand this is rather confusing so please feel free to reply with clarifying questions.

Thanks in advance.

I mean, I can provide it but it doesn't have much of anything to do as my current project doesn't subset the data. I just need a way to subset the data, something that I'm not doing at all so far. — Blair Armstrong, Nov 13 '17 at 22:34

score 0 · Answer 1 · answered Nov 13 '17 at 23:17

If I understand the question correctly, it sounds like you want to calculate things based on some defined group (in your case clusterPosition?). One way to do this with dplyr is to use group_by with summarize or mutate. The later keeps all the rows in your original data set and simply adds a new column to it, the former aggregates like rows and returns a summary statistic for each unique grouped variable.

As an example, if your data looks something like this:

df <- data.frame(leafArea = c(2.0, 3.0, 4.0, 5.0, 6.0), cluster = c(1, 2, 1, 2, 3), clusterPosition = c(1, 1, 2, 2, 1))

To get the mean and standard deviation for each unique clusterPosition you would do something like the below, this returns one row for each unique clusterPosition.

library(dplyr)
df %>% group_by(clusterPosition) %>% summarize(meanArea = mean(leafArea), sdArea = sd(leafArea))

If you want to compare each unique leaf to some characteristic of it's clusterPosition, ie you want to preserve all the individual rows in your original dataset, you can use mutate instead of summarize.

library(dplyr)
df %>% group_by(clusterPosition) %>% mutate(meanPositionArea = mean(leafArea), diffMean = leafArea - meanPositionArea)

Partitioning Data in 'R' based on data size

1 Answers1