1

I am new to Pandas so please forgive me inexperience. Nonetheless I have worked on a lot of the parts of my question here.

For simplicity let's take the example from the wiki article on Quantile Normalization:

A    5    4    3
B    2    1    4
C    3    4    6
D    4    2    8

and update it to fit the data structure that I am dealing with:

df = pd.DataFrame({
        'gene': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'e', 'e', 'e', 'f', 'f', 'f'],
        'rep': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3],
        'val': [5, 4, 3, 2, 1, 4, 3, 4, 6, 4, 2, 8, 0, 1, 0, 0, 2, 4],
        'subset':['y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'n', 'n', 'n', 'n', 'n', 'n'] 
})


    gene rep subset val
0   a   1   y   5
1   a   2   y   4
2   a   3   y   3
3   b   1   y   2
4   b   2   y   1
5   b   3   y   4
6   c   1   y   3
7   c   2   y   4
8   c   3   y   6
9   d   1   y   4
10  d   2   y   2
11  d   3   y   8
12  e   1   n   0
13  e   2   n   1
14  e   3   n   0
15  f   1   n   0
16  f   2   n   2
17  f   3   n   4

This flattened structure might seem odd and inefficient (at the very least redundant), but for my particular use case - it is the best option - so please bare with it.

In this example we want to run quantile normalization on the original data (genes a - d), so we grab the subset (take a subset on some meta datakey):

sub = df[df.subset == 'y']

The shape is still off so using the pivot function as I recently learned from @Wan from my GroupBy question:

piv = sub.pivot(index='gene', columns='rep', values='val')

rep 1   2   3
gene            
a   5   4   3
b   2   1   4
c   3   4   6
d   4   2   8

This results in the lost of the other columns which may or may not be relevant for later. Carrying on, using my quantile normalization function that can handle mixed dataframes:

quantile_normalize(piv, [1, 2, 3])

rep     1   2   3
gene            
a   5.666667    4.666667    2.000000
b   2.000000    2.000000    3.000000
c   3.000000    4.666667    4.666667
d   4.666667    3.000000    5.666667

which is the expected result from the wiki:

A    5.67    4.67    2.00
B    2.00    2.00    3.00
C    3.00    4.67    4.67
D    4.67    3.00    5.67

neat.

Now my question:

How do I take these values and plug them back into the original data frame?

SumNeuron
  • 4,850
  • 5
  • 39
  • 107

1 Answers1

1

You can merge your result back to the original dataframe after melt-ing your result dataframe:

result = quantile_normalize(piv, [1, 2, 3])
result = result.reset_index().melt(id_vars='gene', value_name='quantile')
result
>>>   gene rep  quantile
0     a   1         5.666667
1     b   1         2.000000
2     c   1         3.000000
3     d   1         4.666667
4     a   2         4.666667
5     b   2         2.000000
6     c   2         4.666667
7     d   2         3.000000
8     a   3         2.000000
9     b   3         3.000000
10    c   3         4.666667
11    d   3         5.666667

df = pd.merge(df, result, on=['gene', 'rep'], how='outer')
df 
>>>    gene  rep subset  val  quantile
0     a    1      y    5  5.666667
1     a    2      y    4  4.666667
2     a    3      y    3  2.000000
3     b    1      y    2  2.000000
4     b    2      y    1  2.000000
5     b    3      y    4  3.000000
6     c    1      y    3  3.000000
7     c    2      y    4  4.666667
8     c    3      y    6  4.666667
9     d    1      y    4  4.666667
10    d    2      y    2  3.000000
11    d    3      y    8  5.666667
12    e    1      n    0       NaN
13    e    2      n    1       NaN
14    e    3      n    0       NaN
15    f    1      n    0       NaN
16    f    2      n    2       NaN
17    f    3      n    4       NaN
Prikers
  • 858
  • 1
  • 9
  • 24