I am new to Pandas so please forgive me inexperience. Nonetheless I have worked on a lot of the parts of my question here.
For simplicity let's take the example from the wiki article on Quantile Normalization:
A 5 4 3
B 2 1 4
C 3 4 6
D 4 2 8
and update it to fit the data structure that I am dealing with:
df = pd.DataFrame({
'gene': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'e', 'e', 'e', 'f', 'f', 'f'],
'rep': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3],
'val': [5, 4, 3, 2, 1, 4, 3, 4, 6, 4, 2, 8, 0, 1, 0, 0, 2, 4],
'subset':['y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'n', 'n', 'n', 'n', 'n', 'n']
})
gene rep subset val
0 a 1 y 5
1 a 2 y 4
2 a 3 y 3
3 b 1 y 2
4 b 2 y 1
5 b 3 y 4
6 c 1 y 3
7 c 2 y 4
8 c 3 y 6
9 d 1 y 4
10 d 2 y 2
11 d 3 y 8
12 e 1 n 0
13 e 2 n 1
14 e 3 n 0
15 f 1 n 0
16 f 2 n 2
17 f 3 n 4
This flattened structure might seem odd and inefficient (at the very least redundant), but for my particular use case - it is the best option - so please bare with it.
In this example we want to run quantile normalization on the original data (genes a - d), so we grab the subset (take a subset on some meta datakey):
sub = df[df.subset == 'y']
The shape is still off so using the pivot
function as I recently learned from @Wan from my GroupBy question:
piv = sub.pivot(index='gene', columns='rep', values='val')
rep 1 2 3
gene
a 5 4 3
b 2 1 4
c 3 4 6
d 4 2 8
This results in the lost of the other columns which may or may not be relevant for later. Carrying on, using my quantile normalization function that can handle mixed dataframes:
quantile_normalize(piv, [1, 2, 3])
rep 1 2 3
gene
a 5.666667 4.666667 2.000000
b 2.000000 2.000000 3.000000
c 3.000000 4.666667 4.666667
d 4.666667 3.000000 5.666667
which is the expected result from the wiki:
A 5.67 4.67 2.00
B 2.00 2.00 3.00
C 3.00 4.67 4.67
D 4.67 3.00 5.67
neat.
Now my question:
How do I take these values and plug them back into the original data frame?