-3

This is a very large dataset and I'm trying to get away from writing for loops in R. Looking for a way to attack what I would usually use a nested loop to do.

For each unique value in the confidence col., I need to extract the row indices for all other rows in the confidence col. that match that value. For example, the first occurrence, (50) would return 1,7,9. Then, using those indices, I want to average the values for the seqs column. Here, the first occurrence (50) would return 1980, 7357, and 3008 and then average these. The indented output would be a data frame with 2 columns: one with a list of unique values for confidence and one with a corresponding list of the average # seqs for each unique confidence value.

input

#seqs       confidence
1980        50
1088        52
1099        52
2000        42
7009        45
1092        48
7357        50
5909        42
3008        50

output

ave.#seqs     confidence
4115          50
1093.5        52 
3954.5        42...
MycoP
  • 137
  • 5

1 Answers1

-1

Given that it's a "very large dataset", I suggest a data.table solution.

library(data.table)
> setDT(data)[, mean(seqs), by=confidence]
   confidence     V1
1:         50 4115.0
2:         52 1093.5
3:         42 3954.5
4:         45 7009.0
5:         48 1092.0

Solutions using dplyr functions or aggregate would also work, but they're less efficient.

Yannis Vassiliadis
  • 1,719
  • 8
  • 14