0

I am using tapply to combine a table by Sample ID(SID). For the first sample on the list there are 3 measurements but it appears as only one.

I have 4 things that need to pass to the new table. First is SID. Second is the mean of the areas for all measurements that have that SID. Third is all the Distances. Finally the number of measurements.

cases_iTLS <- data.frame(unique(iTLS$SID))
colnames(cases_iTLS)[colnames(cases_iTLS)=="unique.iTLS.SID."] <- "SID"
cases_iTLS$SID <- factor(cases_iTLS$SID)

# Average of TLS on one slide for area
cases_iTLS$Area_iTLS <- tapply(iTLS$Area, iTLS$SID,FUN=mean) 

# Average of TLS on one slide for distance
cases_iTLS$Distance_iTLS <- tapply(iTLS$Distance, iTLS$SID,FUN=mean) 

# Number of measurements per SID
cases_iTLS$Count_iTLS <- tapply(iTLS$Region_Index, iTLS$SID,FUN=length) 


SID       Region_index   Area         Distance    Type    Location
112906    1              53531.53     71.982      iTLS    intratumoral
112906    3              76809.61     97.384      iTLS    intratumoral
112906    5              40937.30     9.643       iTLS    intratumoral
112947    1              35071.66     2.067       iTLS    intratumoral
112947    3              17979.88     36.319      iTLS
Andre
  • 123
  • 1
  • 13
  • 2
    What is your question? What is the error or undesired result of your code? – Parfait Jul 24 '19 at 14:15
  • The output for the first sample is Count_iTLS = 1. But in the input there are 3 rows with unique Region_index. The desired output should be Count_iTLS=3. Additionally, the two other tapply are giving incorrect means. – Andre Jul 24 '19 at 14:17
  • With no example of structure nor datas is difficult to answer. Is the mean false because of use or not use of `NA` ? : [How to pass na.rm as argument to tapply?](https://stackoverflow.com/a/26644583/10489562) – phili_b Jul 24 '19 at 14:26
  • @phili_b I have added the structure of the data to the main question. There are not cells in the data that are NA – Andre Jul 24 '19 at 14:31
  • With a `dput(myvariable)` to put here the structure would be easier to test :) – phili_b Jul 24 '19 at 14:37

1 Answers1

1

Because you need to run separate aggregate functions (mean and length) across multiple columns (Area, Distance, and SID), consider using aggregate for grouping aggregation to return a data frame.

Usually, tapply runs on a single numeric metric not across columns or functions to return a single named, atomic vector. Below calls a do.call + data.frame to bind the nested result of multiple aggregations

aggregate

# AGGREGATE ACROSS COLS AND FUNCS
cases_iTLS <- aggregate(cbind(Area, Distance, Region_Index) ~ SID, iTLS, 
                        function(x) c(mean=mean(x), count = length(x))

# BIND NESTED, UNDERLYING RESULTS
cases_iTLS <- do.call(data.frame, cases_iTLS)

# KEEP NEEDED COLUMNS
cases_iTL <- cases_iTL[c("SID", "Area.mean", "Distance.mean", "Region_Index.count")

tapply

Should you want to go the tapply route, consider building a matrix of your separate aggregations with rbind and transpose t:

cases_iTL_mat <- with(iTLS, 
                         t(rbind(Area_mean = tapply(Area, SID, FUN=mean) ,
                                 Distance_mean = tapply(Distance, SID, FUN=mean),
                                 Region_count = tapply(Region_Index, SID, FUN=length)
                          ))
                 )

by

And I would be remiss not to point by (the object-oriented wrapper to tapply):

cases_iTL_mat <- do.call(rbind, 
        by(iTLS, iTLS$SID, function(sub) {
               c(Area_mean = mean(sub$Area),
                 Distance_mean = mean(sub$Distance),
                 Region_count = length(sub$Region_Index))
          })
)
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • This seems to work. I just had question about the "Keep Needed Columns" part. Am I required to uses those names to call the function or is that just renaming the columns – Andre Jul 24 '19 at 14:38
  • Period-qualifying names derive from the aggregate call on columns but can be renamed afterwards or within the `c()` function. Specifically, inside `aggregate`, `c(Andre_func = mean(x))` renders at the end as `Area.Andre_func`. – Parfait Jul 24 '19 at 14:44