I have a H.U.G.E. dataset. Each row will contain a phylum name, one measurement (M8) and a file name. There are 10 unique bacteria names and 170 unique file (names).
The goal is to calculate the relative abundance and mean M8 of each phylum for each file. I know I can find the mean of M8...But I cannot quite figure out how to calculate the relative abundance at the same time. To be clear, to find the relative abundance for Actinobacteria in file x,
Z = number of times there is an entry for file x in the dataset
K = number of times there is an entry for *Actinobacteria* associated with file x
Relative abundance = K/Z.
I created small dataset by randomly selecting 20 rows.
Phylum M8 Filename
Crenarchaeota 60.53 4440041.3
Proteobacteria 44.34 4440059.3
Proteobacteria 58.59 4440319.3
Firmicutes 21.49 4440368.3
Proteobacteria 50.96 4440419.3
Firmicutes 37.27 4447102.3
Actinobacteria 70.11 4461011.3
Actinobacteria 64.11 4461140.3
Actinobacteria 54.33 4461152.3
Actinobacteria 68.06 4461158.3
Firmicutes 58.95 4461168.3
Firmicutes 38.81 4461186.3
Proteobacteria 58.0 4461199.3
Actinobacteria 58.73 4461210.3
Firmicutes 44.59 4461211.3
Euryarchaeota 45.56 4461229.3
Euryarchaeota 58.0 4477874.3
Proteobacteria 62.0 4477874.3
Proteobacteria 57.0 4477874.3
Proteobacteria 56.0 4477874.3
I find the mean for M8 by Filename
library('plyr')
myDF = read.csv(fileName, header = TRUE, sep = ' ')
myDF$Filename <- as.character(myDF$Filename)
myDF.mean = ddply(myDF, .(Filename), summarize, meanM8= mean(M8, na.rm=TRUE))
print(myDF.mean)
Phylum Filename meanM8
1 Actinobacteria 4461011.3 70.11000
2 Actinobacteria 4461140.3 64.11000
3 Actinobacteria 4461152.3 54.33000
4 Actinobacteria 4461158.3 68.06000
5 Actinobacteria 4461210.3 58.73000
6 Crenarchaeota 4440041.3 60.53000
7 Euryarchaeota 4461229.3 45.56000
8 Euryarchaeota 4477874.3 58.00000
9 Firmicutes 4440368.3 21.49000
10 Firmicutes 4447102.3 37.27000
11 Firmicutes 4461168.3 58.95000
12 Firmicutes 4461186.3 38.81000
13 Firmicutes 4461211.3 44.59000
14 Proteobacteria 4440059.3 44.34000
15 Proteobacteria 4440319.3 58.59000
16 Proteobacteria 4440419.3 50.96000
17 Proteobacteria 4461199.3 58.00000
18 Proteobacteria 4477874.3 58.33333
Everything looks good...(this exercise is trivial for this dataset with the exception of Proteobacteria for file 4477874.3 - which has 3 entries (4 entries for 4477874.3)).
myDF.RA= ddply(myDF, .(Phylum, Filename), summarize, meanM8=mean(m8), RA = sum(length(Phylum))/sum(length(Filename)))
print(myDF.RA)
Phylum Filename meanM8 RA
1 Actinobacteria 4461011.3 70.11000 1
2 Actinobacteria 4461140.3 64.11000 1
3 Actinobacteria 4461152.3 54.33000 1
4 Actinobacteria 4461158.3 68.06000 1
5 Actinobacteria 4461210.3 58.73000 1
6 Crenarchaeota 4440041.3 60.53000 1
7 Euryarchaeota 4461229.3 45.56000 1
8 Euryarchaeota 4477874.3 58.00000 1
9 Firmicutes 4440368.3 21.49000 1
10 Firmicutes 4447102.3 37.27000 1
11 Firmicutes 4461168.3 58.95000 1
12 Firmicutes 4461186.3 38.81000 1
13 Firmicutes 4461211.3 44.59000 1
14 Proteobacteria 4440059.3 44.34000 1
15 Proteobacteria 4440319.3 58.59000 1
16 Proteobacteria 4440419.3 50.96000 1
17 Proteobacteria 4461199.3 58.00000 1
18 Proteobacteria 4477874.3 58.33333 1
For Proteobacteria associated with file 4477874.3, the RA should be 3/4 = .75
How can I properly calculate the relative abundance? Thank you.