Normalizing based on three layers of factor categorization

Question

I'm hoping to get some advice as to whether I'm working in the right direction as I try to normalize some data.

In my dataset I have three factor columns ("Cell types", "Timepoint", and "Treatment") and a numerical column representing the effectiveness of a treatment.

I would like to normalize my numerical column by the Untreated subset in a given timepoint for a given cell type. For example (forgive me, I'm not sure how to best include sample data on this forum yet...)

Raw data table

So ideally, I would normalize treatments A, B, and C relative to Untreated and get the following:

Normalized data

On MATLAB I would have done this with loops, but I'm hoping I might be able to use ddply in R. From what I've learned about ddply is that I should be able to separate out my data frame based on my factor columns, but I'm not sure how to write a function that identifies the right normalization factor in each subgroup.

Any suggestions would be greatly appreciated.

EDIT:

Thank you to Alistaire for teaching me how to post data correctly. Heres the dput output for part of my dataset...

dput(df2)
structure(list(Cell.Line = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("CAOV3", 
"COV318", "COV362", "FUOV1", "FUOV2", "JHOS2", "JHOS4"), class = "factor"), 
    Median.RFU = c(99307, 13684, 207127, 294911, 2480000, 2510000, 
    927000, 1050000, 84074, 96132, 294911, 129310, 54001, 10595, 
    55558, 60015, 242676, 133580, 88273, 116825, 46846, 49991, 
    54442, 48590, 275237, 112631, 125685, 361313, 7330000, 139117, 
    4720000, 2640000, 13193, 154611, 2230000, 54001, 83733, 69464, 
    54663, 54886, 384009, 54663, 721000, 1100000, 13574, 51852, 
    307136, 54663, 53131, 55558), Timepoint = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 1L, 1L), .Label = c("24 hours", "4 hours"), class = "factor"), 
    Treatment = structure(c(3L, 12L, 11L, 10L, 9L, 8L, 7L, 6L, 
    5L, 4L, 2L, 1L, 3L, 12L, 11L, 10L, 9L, 8L, 7L, 6L, 5L, 4L, 
    2L, 1L, 1L, 2L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 3L, 
    1L, 2L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 3L, 1L, 2L
    ), .Label = c("Dextran Sulfate", "Fucoidan", "gCML", "Heparin Folate", 
    "Heparin Sulfate", "Hyaluronic Acid", "Poly (acrylic) Acid", 
    "Poly-L-Aspartic Acid", "Poly-L-Glutamic Acid", "Poly-L-Glutamic Acid-b-Polyethylene glycol", 
    "Sulfated B-Cyclodextrin Polymer", "Untreated"), class = "factor")), class = "data.frame", row.names = c(NA, 
-50L), .Names = c("Cell.Line", "Median.RFU", "Timepoint", "Treatment"
))

Also thanks a bunch for pointing me towards dplyr, it has helped tremendously! I think I'm pretty close to finding a solution but I'm struggling to correctly identify the normalization factor. Since the untreated condition almost always has the lowest response (variable Median.RFU) I can very nearly obtain the desired processing with the following code

a1 <- group_by(df2, Cell.Line, Timepoint)

a2 <- mutate(a1, Normalized = Median.RFU / min(Median.RFU))

However this doesn't always hold - as you can see for the COV318 cell type where the untreated value is not the minimum.

Now I'm struggling to identify the value in the Median.RFU column that corresponds to the "Untreated" rows of the Treatment column. I tried to identify it as

a1[a1$Treatment=="Untreated",2]

but it looks like it is dividing each cell by the entire Untreated subset and generating a data.frame within a data.frame. Any advice on how to move forward?

EDIT2:

I've got code that works now, though I'm sure there's likely a more elegant solution. What I ended up doing was using the arrange() function to order my data.frame so that the untreated group was the last element of the group_by() subset.

df2<-arrange(df,Cell.Line,Timepoint,Treatment)
a1 <- group_by(df2,Cell.Line,Timepoint)
a2 <- mutate(a1,Norm.MFI = Median.RFU / last(Median.RFU))

Incredible though how I can do this in just a few lines of code as opposed to the loops I had traditionally relied on in MATLAB! Thank you all for your help, and I would still welcome advice if you think there's a better way of doing this.

The best way to post data is to post the results of `dput(df)`, which other users can copy and run to get the exact same data. Pictures are bad, because getting data out of them is...well, not impossible, but really hard. [Here's some useful reading on how to write a great R question that might be helpful.](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) — alistaire, May 08 '16 at 05:26
Also, `dplyr` is the successor to `plyr`; if you're just starting, just learn `dplyr` (which is more intuitive, anyway). — alistaire, May 08 '16 at 05:27
Thank you for your help! I've updated my post with some real data and the progress I've made using dplyr. — Sancor, May 08 '16 at 16:07

Normalizing based on three layers of factor categorization

0 Answers0