I'm hoping to get some advice as to whether I'm working in the right direction as I try to normalize some data.
In my dataset I have three factor columns ("Cell types"
, "Timepoint"
, and "Treatment"
) and a numerical column representing the effectiveness of a treatment.
I would like to normalize my numerical column by the Untreated
subset in a given timepoint for a given cell type. For example (forgive me, I'm not sure how to best include sample data on this forum yet...)
So ideally, I would normalize treatments A
, B
, and C
relative to Untreated
and get the following:
On MATLAB I would have done this with loops, but I'm hoping I might be able to use ddply
in R. From what I've learned about ddply
is that I should be able to separate out my data frame based on my factor columns, but I'm not sure how to write a function that identifies the right normalization factor in each subgroup.
Any suggestions would be greatly appreciated.
EDIT:
Thank you to Alistaire for teaching me how to post data correctly. Heres the dput output for part of my dataset...
dput(df2)
structure(list(Cell.Line = structure(c(3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("CAOV3",
"COV318", "COV362", "FUOV1", "FUOV2", "JHOS2", "JHOS4"), class = "factor"),
Median.RFU = c(99307, 13684, 207127, 294911, 2480000, 2510000,
927000, 1050000, 84074, 96132, 294911, 129310, 54001, 10595,
55558, 60015, 242676, 133580, 88273, 116825, 46846, 49991,
54442, 48590, 275237, 112631, 125685, 361313, 7330000, 139117,
4720000, 2640000, 13193, 154611, 2230000, 54001, 83733, 69464,
54663, 54886, 384009, 54663, 721000, 1100000, 13574, 51852,
307136, 54663, 53131, 55558), Timepoint = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L), .Label = c("24 hours", "4 hours"), class = "factor"),
Treatment = structure(c(3L, 12L, 11L, 10L, 9L, 8L, 7L, 6L,
5L, 4L, 2L, 1L, 3L, 12L, 11L, 10L, 9L, 8L, 7L, 6L, 5L, 4L,
2L, 1L, 1L, 2L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 3L,
1L, 2L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 3L, 1L, 2L
), .Label = c("Dextran Sulfate", "Fucoidan", "gCML", "Heparin Folate",
"Heparin Sulfate", "Hyaluronic Acid", "Poly (acrylic) Acid",
"Poly-L-Aspartic Acid", "Poly-L-Glutamic Acid", "Poly-L-Glutamic Acid-b-Polyethylene glycol",
"Sulfated B-Cyclodextrin Polymer", "Untreated"), class = "factor")), class = "data.frame", row.names = c(NA,
-50L), .Names = c("Cell.Line", "Median.RFU", "Timepoint", "Treatment"
))
Also thanks a bunch for pointing me towards dplyr, it has helped tremendously! I think I'm pretty close to finding a solution but I'm struggling to correctly identify the normalization factor. Since the untreated condition almost always has the lowest response (variable Median.RFU) I can very nearly obtain the desired processing with the following code
a1 <- group_by(df2, Cell.Line, Timepoint)
a2 <- mutate(a1, Normalized = Median.RFU / min(Median.RFU))
However this doesn't always hold - as you can see for the COV318 cell type where the untreated value is not the minimum.
Now I'm struggling to identify the value in the Median.RFU column that corresponds to the "Untreated" rows of the Treatment column. I tried to identify it as
a1[a1$Treatment=="Untreated",2]
but it looks like it is dividing each cell by the entire Untreated subset and generating a data.frame within a data.frame. Any advice on how to move forward?
EDIT2:
I've got code that works now, though I'm sure there's likely a more elegant solution. What I ended up doing was using the arrange() function to order my data.frame so that the untreated group was the last element of the group_by() subset.
df2<-arrange(df,Cell.Line,Timepoint,Treatment)
a1 <- group_by(df2,Cell.Line,Timepoint)
a2 <- mutate(a1,Norm.MFI = Median.RFU / last(Median.RFU))
Incredible though how I can do this in just a few lines of code as opposed to the loops I had traditionally relied on in MATLAB! Thank you all for your help, and I would still welcome advice if you think there's a better way of doing this.