Using .iterrows() with series.nlargest() to get the highest number in a row in a Dataframe

Question

I am trying to create a function that uses df.iterrows() and Series.nlargest. I want to iterate over each row and find the largest number and then mark it as a 1. This is the data frame:

A   B    C
9   6    5
3   7    2

Here is the output I wish to have:

A    B   C
1    0   0
0    1   0

This is the function I wish to use here:

def get_top_n(df, top_n):
    """


    Parameters
    ----------
    df : DataFrame

    top_n : int
        The top number to get
    Returns
    -------
    top_numbers : DataFrame
    Returns the top number marked with a 1

    """
    # Implement Function
    for row in df.iterrows():
        top_numbers = row.nlargest(top_n).sum()

    return top_numbers

I get the following error: AttributeError: 'tuple' object has no attribute 'nlargest'

Help would be appreciated on how to re-write my function in a neater way and to actually work! Thanks in advance

jezrael · Accepted Answer · 2018-08-02T06:52:06.187

9

Add i variable, because iterrows return indices with Series for each row:

for i, row in df.iterrows():
    top_numbers = row.nlargest(top_n).sum()

General solution with numpy.argsort for positions in descending order, then compare and convert boolean array to integers:

def get_top_n(df, top_n):
    if top_n > len(df.columns):
        raise ValueError("Value is higher as number of columns")
    elif not isinstance(top_n, int):
        raise ValueError("Value is not integer")

    else:
        arr = ((-df.values).argsort(axis=1) < top_n).astype(int)
        df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
        return (df1)

df1 = get_top_n(df, 2)
print (df1)
   A  B  C
0  1  1  0
1  1  1  0

df1 = get_top_n(df, 1)
print (df1)
   A  B  C
0  1  0  0
1  0  1  0

EDIT:

Solution with iterrows is possible, but not recommended, because slow:

top_n = 2
for i, row in df.iterrows():
    top = row.nlargest(top_n).index
    df.loc[i] = 0
    df.loc[i, top] = 1

print (df)
   A  B  C
0  1  1  0
1  1  1  0

edited Aug 02 '18 at 06:52

answered Aug 02 '18 at 05:15

jezrael

822,522
95
1,334
1,252

Ok. So do I still need to implement the iterrows()? THe final output must be placed in the top_numbers variable. THe function should return the top_numbers @jezrael – Deepak M Aug 02 '18 at 05:31
when I try the code above in the function I changed df1 to top_numbers. But I get this error now `AssertionError: Wrong value for get_top_n.` @jezrael – Deepak M Aug 02 '18 at 05:45
@DeepakM - OK, so what is expected output? `iterrows` is best avoid, because slow. – jezrael Aug 02 '18 at 05:47
Basically, I want the function to pass in any number in the n_tops variable. So that the function can be re-usable. So when I return top_numbers it takes in any number and the function re usable @jezrael – Deepak M Aug 02 '18 at 05:50
Can you tailor the solution in the function itself @jezrael – Deepak M Aug 02 '18 at 05:51
So need rewrite solution to function? Then please check edited answer. – jezrael Aug 02 '18 at 05:54
I think question is more complicated as seems, what is expected output for `np.random.seed(10) df = pd.DataFrame(np.random.randint(10, size=(5, 5))) print (df)` e.g. for top `2` ? Or top `4` ? – jezrael Aug 02 '18 at 07:17
Hey @jezrael I get this error when I use your updated function `AssertionError: Wrong value for get_top_n.` WHy do you think that occurs – Deepak M Aug 02 '18 at 08:49
@DeepakM - It seems some data problem, e.g. some columns are non numeric? – jezrael Aug 02 '18 at 08:53
Iwant to Implement the get_top_n function to get the top performing number for each month. Get the top performing number from df by assigning them a value of 1. For all other values, give them a value of 0. – Deepak M Aug 02 '18 at 08:56
Is possible get expected output of `np.random.seed(10) df = pd.DataFrame(np.random.randint(10, size=(5, 5))) print (df)` for top2? – jezrael Aug 02 '18 at 08:57
Not sure, what is problem. I need [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve), also if working with your data sample, is possible real data are different. – jezrael Aug 02 '18 at 09:01
So maybe help if add snippet of real data, e.g. first 3 rows with expected output. – jezrael Aug 02 '18 at 09:02
Hey man just to let you know the solution with `iterrows() worked! Thanks again! – Deepak M Aug 03 '18 at 02:35
The solution does not work properly when your rows has some `NaN`. `argsort` function sorts `NaN` as well as any number – Jocer May 20 '20 at 17:39

Josmoor98 · Answer 2 · 2018-12-21T13:25:18.047

For context, the dataframe consists of stock return data for the S&P500 over approximately 4 years

def get_top_n(prev_returns, top_n):

    # generate dataframe populated with zeros for merging
    top_stocks = pd.DataFrame(0, columns = prev_returns.columns, index = prev_returns.index)

    # find top_n largest entries by row
    df = prev_returns.apply(lambda x: x.nlargest(top_n), axis=1)

    # merge dataframes
    top_stocks = top_stocks.merge(df, how = 'right').set_index(df.index)

    # return dataframe replacing non_zero answers with a 1
    return (top_stocks.notnull()) * 1

score 1 · Answer 3 · answered Aug 21 '21 at 03:03

Alternatively, the 2-line solution could be

def get_top_n(df, top_n):

    # find top_n largest entries by stock
    df = df.apply(lambda x: x.nlargest(top_n), axis=1)

    # convert dataframe NaN or float entries True and False, and then convert to 0 and 1
    top_numbers = (df.notnull()).astype(np.int)

    return top_numbers

Using .iterrows() with series.nlargest() to get the highest number in a row in a Dataframe

3 Answers3