2

Say I have a pandas dataframe that looks like this:

   color  number
0    red       3
1   blue       4
2  green       2
3   blue       2

I want to get the first value from the number column where the color column has the value 'blue' which in this case would return 4.

I know this can be done using loc in something like this:

df[df['color'] == 'blue']['number'][0]

I'm wondering if there is any more optimal approach given that I only ever need the first occurrence.

jpp
  • 159,742
  • 34
  • 281
  • 339
UBears
  • 371
  • 1
  • 4
  • 18

3 Answers3

3

Use head—this will return the first row if the color exists, and an empty Series otherwise.

col = 'blue'
df.query('color == @col').head(1).loc[:, 'number']

1    4
Name: number, dtype: int64

Alternatively, to get a single item, use obj.is_empty:

u = df.query('color == @col').head(1)
if not u.is_empty:
    print(u.at[u.index[0], 'number'])

# 4

Difference between head and idxmax for invalid color:

df.query('color == "blabla"').head(1).loc[:, 'number']
# Series([], Name: number, dtype: int64)

df.loc[(df['color'] == 'blabla').idxmax(),'number']
# 3
cs95
  • 379,657
  • 97
  • 704
  • 746
2

Using idxmax

df.loc[(df['color'] == 'blue').idxmax(),'number']
Out[698]: 4
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 2
    You need an if statement here before calling idxmax. – cs95 Jan 18 '19 at 17:24
  • @coldspeed Like this? `df.loc[if (df['color'] == 'blue').idxmax(),'number']` – UBears Jan 18 '19 at 18:20
  • @UBears No, for example see my answer below. Alternatively, `if (df['color'] == 'blue').any(): ...` – cs95 Jan 18 '19 at 18:21
  • @coldspeed So I need an if statement only to check if the column is empty? – UBears Jan 18 '19 at 18:24
  • 1
    @UBears no, it's to check whether "blue" exists. Otherwise, if blue does not exist, `idxmax` always returns 0, when the result returned should be empty. Does that make sense? Once again, you can see my answer for example and comparison of the answers – cs95 Jan 18 '19 at 18:27
1

Using iloc with np.where:

idx = next(iter(df['number'].iloc[np.where(df['color'].eq('blue'))]), -1)  # 4

Note this also handles the case where the colour does not exist. In comparison, df['color'].eq('orange').idxmax() gives 0 even though 'orange' does not exist in the series. The above logic will give -1.

numba

I'm wondering if there is any more optimal approach given that I only ever need the first occurrence.

Yes! For a more efficient solution, see Efficiently return the index of the first value satisfying condition in array. Numba allows you to iterate row-wise efficiently. In this case, you will need to factorize your strings first so that you feed numeric arrays only to Numba:

from numba import njit

# factorize series, pd.factorize maintains order,
# i.e. first item in values gives 0 index
idx, values = pd.factorize(df['color'])
idx_search = np.where(values == 'blue')[0][0]

@njit
def get_first_index_nb(A, k):
    for i in range(len(A)):
        if A[i] == k:
            return i
    return -1

res = df['number'].iat[get_first_index_nb(idx, 1)]  # 4

Of course, for a one-off calculation, this is inefficient. But for successive calculations, the solution will likely be a factor faster than solutions which check for equality across the entire series / array.

jpp
  • 159,742
  • 34
  • 281
  • 339