1

I have a H.U.G.E. dataset. Each row will contain a phylum name, one measurement (M8) and a file name. There are 10 unique bacteria names and 170 unique file (names).

The goal is to calculate the relative abundance and mean M8 of each phylum for each file. I know I can find the mean of M8...But I cannot quite figure out how to calculate the relative abundance at the same time. To be clear, to find the relative abundance for Actinobacteria in file x,

Z = number of times there is an entry for file x in the dataset
K = number of times there is an entry for *Actinobacteria* associated with file x

Relative abundance = K/Z.

I created small dataset by randomly selecting 20 rows.

Phylum M8 Filename
Crenarchaeota 60.53 4440041.3
Proteobacteria 44.34 4440059.3
Proteobacteria 58.59 4440319.3
Firmicutes 21.49 4440368.3
Proteobacteria 50.96 4440419.3
Firmicutes 37.27 4447102.3
Actinobacteria 70.11 4461011.3
Actinobacteria 64.11 4461140.3
Actinobacteria 54.33 4461152.3
Actinobacteria 68.06 4461158.3
Firmicutes 58.95 4461168.3
Firmicutes 38.81 4461186.3
Proteobacteria 58.0 4461199.3
Actinobacteria 58.73 4461210.3
Firmicutes 44.59 4461211.3
Euryarchaeota 45.56 4461229.3
Euryarchaeota 58.0 4477874.3
Proteobacteria 62.0 4477874.3
Proteobacteria 57.0 4477874.3
Proteobacteria 56.0 4477874.3

I find the mean for M8 by Filename

library('plyr')
myDF = read.csv(fileName, header = TRUE, sep = ' ')
myDF$Filename <- as.character(myDF$Filename)

myDF.mean = ddply(myDF, .(Filename), summarize, meanM8= mean(M8, na.rm=TRUE))
print(myDF.mean)

           Phylum  Filename   meanM8
1  Actinobacteria 4461011.3 70.11000
2  Actinobacteria 4461140.3 64.11000
3  Actinobacteria 4461152.3 54.33000
4  Actinobacteria 4461158.3 68.06000
5  Actinobacteria 4461210.3 58.73000
6   Crenarchaeota 4440041.3 60.53000
7   Euryarchaeota 4461229.3 45.56000
8   Euryarchaeota 4477874.3 58.00000
9      Firmicutes 4440368.3 21.49000
10     Firmicutes 4447102.3 37.27000
11     Firmicutes 4461168.3 58.95000
12     Firmicutes 4461186.3 38.81000
13     Firmicutes 4461211.3 44.59000
14 Proteobacteria 4440059.3 44.34000
15 Proteobacteria 4440319.3 58.59000
16 Proteobacteria 4440419.3 50.96000
17 Proteobacteria 4461199.3 58.00000
18 Proteobacteria 4477874.3 58.33333

Everything looks good...(this exercise is trivial for this dataset with the exception of Proteobacteria for file 4477874.3 - which has 3 entries (4 entries for 4477874.3)).

myDF.RA= ddply(myDF, .(Phylum, Filename), summarize, meanM8=mean(m8), RA = sum(length(Phylum))/sum(length(Filename)))
print(myDF.RA)


          Phylum  Filename   meanM8 RA
1  Actinobacteria 4461011.3 70.11000  1
2  Actinobacteria 4461140.3 64.11000  1
3  Actinobacteria 4461152.3 54.33000  1
4  Actinobacteria 4461158.3 68.06000  1
5  Actinobacteria 4461210.3 58.73000  1
6   Crenarchaeota 4440041.3 60.53000  1
7   Euryarchaeota 4461229.3 45.56000  1
8   Euryarchaeota 4477874.3 58.00000  1
9      Firmicutes 4440368.3 21.49000  1
10     Firmicutes 4447102.3 37.27000  1
11     Firmicutes 4461168.3 58.95000  1
12     Firmicutes 4461186.3 38.81000  1
13     Firmicutes 4461211.3 44.59000  1
14 Proteobacteria 4440059.3 44.34000  1
15 Proteobacteria 4440319.3 58.59000  1
16 Proteobacteria 4440419.3 50.96000  1
17 Proteobacteria 4461199.3 58.00000  1
18 Proteobacteria 4477874.3 58.33333  1

For Proteobacteria associated with file 4477874.3, the RA should be 3/4 = .75

How can I properly calculate the relative abundance? Thank you.

Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
cer
  • 1,961
  • 2
  • 17
  • 26

2 Answers2

2

I don't think you can do this in one step, since you need the total number of entries for each file to calculate the relative abundance. In two steps:

library(plyr)
df.file.count <- ddply(df, .(Filename), summarize, file.count=length(Filename))
df.phyl.file <- ddply(df, .(Filename, Phylum), summarize, meanM8=mean(M8), f.ph.count=length(Phylum))
transform(merge(df.file.count, df.phyl.file), RA=f.ph.count/file.count)[c(1, 3, 4, 6)]

# Filename         Phylum   meanM8   RA
# 1   4440041  Crenarchaeota 60.53000 1.00
# 2   4440059 Proteobacteria 44.34000 1.00
# 3   4440319 Proteobacteria 58.59000 1.00
# 4   4440368     Firmicutes 21.49000 1.00
# 5   4440419 Proteobacteria 50.96000 1.00
# 6   4447102     Firmicutes 37.27000 1.00
# 7   4461011 Actinobacteria 70.11000 1.00
# 8   4461140 Actinobacteria 64.11000 1.00
# 9   4461152 Actinobacteria 54.33000 1.00
# 10  4461158 Actinobacteria 68.06000 1.00
# 11  4461168     Firmicutes 58.95000 1.00
# 12  4461186     Firmicutes 38.81000 1.00
# 13  4461199 Proteobacteria 58.00000 1.00
# 14  4461210 Actinobacteria 58.73000 1.00
# 15  4461211     Firmicutes 44.59000 1.00
# 16  4461229  Euryarchaeota 45.56000 1.00
# 17  4477874  Euryarchaeota 58.00000 0.25
# 18  4477874 Proteobacteria 58.33333 0.75

Note I get different metrics. Perhaps I'm missinterpreting your RA calculation. For file 4477874 (I lost the .3s, but they are all there so it doesn't seem to matter), there are a total of 4 entries in the data set (3 for Proteo, 1 for Euryar...), so I calculate RA as 3/4 for 4477874-Proteo. Is this wrong?

As for methodology, first get file counts, then get file/bacteria counts, then merge the two together to calculate bacteria count / file count.

BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • This looks pretty good...question; what is the role of [c(1, 3, 4, 6)]? – cer Jan 14 '14 at 01:44
  • @cer, merge will produce a `data.frame` that contains all the non-common columns from the input `data.frame`s. The last subset you highlight just keeps the columns that you were interested in. You can run the statement without the subset to see what I mean. – BrodieG Jan 14 '14 at 01:50
  • Gotcha. Thank you @BrodieG, you nailed it. – cer Jan 14 '14 at 01:52
  • 3
    Also @cer, I should mention, if `ddply` is running too slowly for your huge data set, you should strongly consider `data.table` as it will likely be much faster (like 10-100x faster). – BrodieG Jan 14 '14 at 01:52
  • There are ~ 35 million rows of data so yes, it does run slowly. Thank you @BrodieG for answering my question and thank you for the advice, 'tis appreciated. You have been ubber helpful. – cer Jan 14 '14 at 02:01
  • Shouldn't this ```print(myDF.mean)``` be of length 17 instead of 18? – marbel Jan 14 '14 at 03:48
  • 1
    @MartínBel, you're right, I think OP didn't update the output after changing the call to `ddply` in their code. Also, thanks for the `data.table` version. – BrodieG Jan 14 '14 at 14:29
2

Here is a data.table solution following Broadie's approach, i'm sure it can be done in fewer steps. Please feel free to edit.

require(data.table)
DT <- data.table(df)
DT[, Filename := as.factor(Filename)]

setkey(DT, Filename)
CountF <- DT[J(levels(Filename)), .N]
setkey(DT, Filename, Phylum)

DT_CJ <- DT[CJ(unique(Filename), unique(Phylum)), .N][N > 0]

setkey(DT_CJ, Filename)
JN <- DT_CJ[J(CountF)]
JN[, RA := N/N.1]

M8 <- DT[, list(meanM8 = mean(M8)), by="Filename,Phylum"]
setkey(JN, Filename, Phylum)
TBL <- JN[J(M8)]

tail(TBL)

#    Filename         Phylum N N.1   RA   meanM8
# 1: 4461199.3 Proteobacteria 1   1 1.00 58.00000
# 2: 4461210.3 Actinobacteria 1   1 1.00 58.73000
# 3: 4461211.3     Firmicutes 1   1 1.00 44.59000
# 4: 4461229.3  Euryarchaeota 1   1 1.00 45.56000
# 5: 4477874.3  Euryarchaeota 1   4 0.25 58.00000
# 6: 4477874.3 Proteobacteria 3   4 0.75 58.33333
marbel
  • 7,560
  • 6
  • 49
  • 68
  • Bel: Could you elaborate a bit on the `CJ` part of the code? What does it stand for, what does it do? Same with RA. Thanks! – Peter Lustig Jan 15 '14 at 19:02
  • 1
    @PeterLustig CJ means Cross Join. This question explains what it is. http://stackoverflow.com/questions/10600060/how-to-do-cross-join-in-r – marbel Jan 15 '14 at 20:25