18

Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?

PS. I know that there is a package named rpy2 which could run R in subprocess, using quantile normalize in R. But the truth is that R cannot compute the correct result when I use the data set as below:

5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
8.535579139044583634e-05,5.128625938538547123e-06,1.635991820040899643e-05,6.291814349531259308e-05,3.006704952043056075e-05,6.881341586355676286e-06
5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
2.845193046348194770e-05,1.538587781561563968e-05,2.944785276073619561e-05,4.194542899687506431e-05,6.013409904086112150e-05,1.032201237953351358e-05

Edit:

What I want:

Given the data shown above, how to apply quantile normalization following steps in https://en.wikipedia.org/wiki/Quantile_normalization.

I found a piece of code in Python declaring that it could compute the quantile normalization:

import rpy2.robjects as robjects
import numpy as np
from rpy2.robjects.packages import importr
preprocessCore = importr('preprocessCore')


matrix = [ [1,2,3,4,5], [1,3,5,7,9], [2,4,6,8,10] ]
v = robjects.FloatVector([ element for col in matrix for element in col ])
m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False)
Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
normalized_matrix = np.array( Rnormalized_matrix)

The code works fine with the sample data used in the code, however when I test it with the data given above the result went wrong.

Since ryp2 provides an interface to run R in python subprocess, I test it again in R directly and the result was still wrong. As a result I think the reason is that the method in R is wrong.

Shawn. L
  • 403
  • 1
  • 3
  • 13
  • I removed the "R" tag since you (1) aren't using R and (2) don't want R in the answer. But if you say "R cannot compute the correct result", it sounds like you are either disparaging R (to what end?) or want somebody to correct your unposted code. Either way, perhaps I'm misunderstanding what you want: quantile normalization needs a source and target distribution and I'm not certain which you're providing here. Can you clarify, please? – r2evans Jun 21 '16 at 05:34
  • @r2evans Thanks for your comment and I already edited the question. FYI, the code I googled runs R as subprocess of Python. After run R directly I found that the result was wrong. Besides, I'm not sure about what do you mean by 'target distribution. According to the Wiki, the computation of quantile normalization doesn't involve that term. The question, hopefully I made it clear, is to apply quantile normalization on the data I gave. – Shawn. L Jun 21 '16 at 06:10
  • You are right, my term of "target" isn't really good. The wiki references *"making two distributions identical"*, so I was wondering what your two distributions were. Now that you provided additional code (and data, defined as `matrix`), I'm confused about which is your actual data to be quant-normed. (Perhaps a stupid question, but is it possible that the matrix is transposed compared with what you actually need?) – r2evans Jun 21 '16 at 06:22
  • @r2evans I'm sorry for the confusion I caused. FYI, the actual data is a (2119055,124) matrix. Data I gave above is the tiny subset of it for testing. And yes, I did consider the question of transpose. As you could see, in the sample code, matrix is (3,5), but the normalized result is (5,3), therefore I summarized that to use this code I need to transpose the matrix first. To be more clear, my data is (4,6) and to use the code I will assign transposed data, i.e. (6,4) to variable `matrix`, and then continue. – Shawn. L Jun 21 '16 at 07:06

9 Answers9

21

Using the example dataset from Wikipedia article:

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})

df
Out: 
   C1  C2  C3
A   5   4   3
B   2   1   4
C   3   4   6
D   4   2   8

For each rank, the mean value can be calculated with the following:

rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()

rank_mean
Out: 
1    2.000000
2    3.000000
3    4.666667
4    5.666667
dtype: float64

Then the resulting Series, rank_mean, can be used as a mapping for the ranks to get the normalized results:

df.rank(method='min').stack().astype(int).map(rank_mean).unstack()
Out: 
         C1        C2        C3
A  5.666667  4.666667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  4.666667  4.666667
D  4.666667  3.000000  5.666667
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • 2
    elegant use of `groupby`, `map`, and `stacking/unstacking`. are you a `pandas` developer? – O.rka Jan 09 '17 at 19:07
  • 3
    Thanks. No, I am just a regular user. – ayhan Jan 09 '17 at 21:04
  • @ayhan Why did you do different ranking method in the first and second processing line, i.e. `first` vs `min`? – Mischa Lisovyi Aug 19 '19 at 14:59
  • 1
    Just pointing out (and self-promoting) that this does not generate the "correct" results according to Wikipedia. I implemented a fast method that does produce the correct results and is installable with conda or pip: https://stackoverflow.com/a/62792272/9544516 – Maarten-vd-Sande Aug 02 '20 at 09:04
  • This looks great! The only thing needed to fix the values with the same rank is to use the average, and interpolate the rank_mean in between. I added the adjustment in my post. https://stackoverflow.com/a/67597273/1486196 – chase May 19 '21 at 05:15
9

Ok I implemented the method myself of relatively high efficiency.

After finishing, this logic seems kind of easy but, anyway, I decided to post it here for any one feels confused like I was when I couldn't googled the available code.

The code is in github: Quantile Normalize

Shawn. L
  • 403
  • 1
  • 3
  • 13
3

One thing worth noticing is that both ayhan and shawn's code use the smaller rank mean for ties, but if you use R package processcore's normalize.quantiles() , it would use the mean of rank means for ties.

Using the above example:

> df

   C1  C2  C3
A   5   4   3
B   2   1   4
C   3   4   6
D   4   2   8

> normalize.quantiles(as.matrix(df))

         C1        C2        C3
A  5.666667  5.166667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  5.166667  4.666667
D  4.666667  3.000000  5.666667
msg
  • 31
  • 3
  • Just here to say that I made a package/answer called qnorm for Python which does handle ties: https://stackoverflow.com/a/62792272/9544516 – Maarten-vd-Sande Jul 08 '20 at 10:04
1

The code below gives identical result as preprocessCore::normalize.quantiles.use.target and I find it simpler clearer than the solutions above. Also performance should be good up to huge array lengths.

import numpy as np

def quantile_normalize_using_target(x, target):
    """
    Both `x` and `target` are numpy arrays of equal lengths.
    """

    target_sorted = np.sort(target)

    return target_sorted[x.argsort().argsort()]

Once you have a pandas.DataFrame easy to do:

quantile_normalize_using_target(df[0].as_matrix(),
                                df[1].as_matrix())

(Normalizing the first columnt to the second one as a reference distribution in the example above.)

deeenes
  • 4,148
  • 5
  • 43
  • 59
1

This is a minor adjustment but I imagine that many have noticed the subtle 'flaw' in @ayhan 's answer.

I have made a small adjustment to it, which gets the 'correct' answer, while not having to resort to any external libraries for such an exceedingly simple function.

The only adjustment necessary is the [Add interpolated values] section.

import pandas as pd

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})

def quant_norm(df):
    ranks = (df.rank(method="first")
              .stack())
    rank_mean = (df.stack()
                   .groupby(ranks)
                   .mean())
    # Add interpolated values in between ranks
    finer_ranks = ((rank_mean.index+0.5).to_list() +
                    rank_mean.index.to_list())
    rank_mean = rank_mean.reindex(finer_ranks).sort_index().interpolate()
    return (df.rank(method='average')
              .stack()
              .map(rank_mean)
              .unstack())
quant_norm(df)

Out[122]: 
         C1        C2        C3
A  5.666667  5.166667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  5.166667  4.666667
D  4.666667  3.000000  5.666667
chase
  • 3,592
  • 8
  • 37
  • 58
0

Possibly more robust to use the median on each row rather than mean (based on code from Shawn. L):

def quantileNormalize(df_input):
    df = df_input.copy()
    #compute rank
    dic = {}
    for col in df:
        dic[col] = df[col].sort_values(na_position='first').values
    sorted_df = pd.DataFrame(dic)
    #rank = sorted_df.mean(axis = 1).tolist()
    rank = sorted_df.median(axis = 1).tolist()
    #sort
    for col in df:
        # compute percentile rank [0,1] for each score in column 
        t = df[col].rank( pct=True, method='max' ).values
        # replace percentile values in column with quantile normalized score
        # retrieve q_norm score using calling rank with percentile value
        df[col] = [ np.nanpercentile( rank, i*100 ) if ~np.isnan(i) else np.nan for i in t ]
    return df
xspensiv
  • 11
  • 2
0

I am new to pandas and late to the question, but I think answer might also be of use. It builds off of the great answer from @ayhan:

def quantile_normalize(dataframe, cols, pandas=pd):

    # copy dataframe and only use the columns with numerical values
    df = dataframe.copy().filter(items=cols)

    # columns from the original dataframe not specified in cols
    non_numeric = dataframe.filter(items=list(filter(lambda col: col not in cols, list(dataframe))))


    rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()  

    norm = df.rank(method='min').stack().astype(int).map(rank_mean).unstack()


    result = pandas.concat([norm, non_numeric], axis=1)
    return result

the main difference here is closer to some real world applications. Often you just have matrices of numerical data in which case the original answer is sufficient.

Sometimes you have text based data in there as well. This lets you specify the columns cols of your numerical data and will run quantile normalization on those columns. At the end it will merge back the non-numeric (or not to be normalized) columns from your original data frame.

e.g. if you added some 'meta-data' (char) to the wiki example:

df = pd.DataFrame({
    'rep1': [5, 2, 3, 4],
    'rep2': [4, 1, 4, 2],
    'rep3': [3, 4, 6, 8],
    'char': ['gene_a', 'gene_b', 'gene_c', 'gene_d']
}, index = ['a', 'b', 'c', 'd'])

you can then call

quantile_normalize(t, ['rep1', 'rep2', 'rep3'])

to get

    rep1        rep2        rep3        char
a   5.666667    4.666667    2.000000    gene_a
b   2.000000    2.000000    3.000000    gene_b
c   3.000000    4.666667    4.666667    gene_c
d   4.666667    3.000000    5.666667    gene_d
SumNeuron
  • 4,850
  • 5
  • 39
  • 107
0

As pointed out by @msg, none of the solutions here take ties into account. I made a python package called qnorm which handles ties, and correctly recreates the Wikipedia quantile normalization example:

import pandas as pd
import qnorm

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})

print(qnorm.quantile_normalize(df))
         C1        C2        C3
A  5.666667  5.166667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  5.166667  4.666667
D  4.666667  3.000000  5.666667

Installation can be done with either pip or conda

pip install qnorm

or

conda config --add channels conda-forge
conda install qnorm
Maarten-vd-Sande
  • 3,413
  • 10
  • 27
  • How is this different than specifying `method='average'` in his `df.rank()`? – Sos Aug 18 '20 at 10:41
  • 1
    @Sos I can't clearly put dataframes in comments, but why don't you try it? I get different results with `method='average'`, and qnorm. It just resolves ties differently. – Maarten-vd-Sande Aug 18 '20 at 10:52
  • I tried to use your package and it threw an non precise type array error. my input is a dataframe as well, any idea how to resolve it? – Xiaoxixi Sep 08 '20 at 13:16
  • @Xiaoxixi thanks for letting me know, I never had problems with this. What is type of each column? You can check with df.dtypes. If you have a github account, can you make an issue on the github page about it, with a small piece of code that reproduces the error? Then I should be able to fix it quickly: https://github.com/Maarten-vd-Sande/qnorm/issues/new – Maarten-vd-Sande Sep 08 '20 at 17:05
  • @Xiaoxixi I did some checking around and it happens when you use "non standard" datatypes (e.g. float16). It will still crash, but now output a message telling you to convert to e.g. float32 – Maarten-vd-Sande Sep 16 '20 at 13:04
  • @Maarten-vd-Sande, Hi sorry I missed your first message. But yes, I tried to convert it to floats and it worked! – Xiaoxixi Sep 18 '20 at 14:56
  • @Xiaoxixi Great! The newest release should do the conversion automatically for you – Maarten-vd-Sande Sep 19 '20 at 15:34
0

Note that scikit-learn offers a module for quantile normalization:

import pandas as pd
import sklearn.preprocessing

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})
sklearn.preprocessing.quantile_transform(df)
array([[1.  , 1.  , 0.  ],
       [0.  , 0.  , 0.33],
       [0.33, 1.  , 0.67],
       [0.67, 0.33, 1.  ]])

It is also possible to adjust the data to a normal distribution instead of a uniform distribution:

sklearn.preprocessing.quantile_transform(df, output_distribution="normal")
array([[ 5.2 ,  5.2 , -5.2 ],
       [-5.2 , -5.2 , -0.43],
       [-0.43,  5.2 ,  0.43],
       [ 0.43, -0.43,  5.2 ]])
Gregor Sturm
  • 2,792
  • 1
  • 25
  • 34