0

I'm trying to write two for loops that will return a score for different inputs, and create a new field with the new score. The first loop works fine but the second loop never returns the correct score.

import pandas as pd

d = {'a':['foo','bar'], 'b':[1,3]}

df = pd.DataFrame(d)

score1 = df.loc[df['a'] == 'foo']
score2 = df.loc[df['a'] == 'bar']

for i in score1['b']:
    if i < 3:
        score1['c'] = 0
    elif i <= 3 and i < 4:
        score1['c'] = 1
    elif i >= 4 and i < 5:
        score1['c'] = 2
    elif i >= 5 and i < 8:
        score1['c'] = 3
    elif i == 8:
        score1['c'] = 4

for j in score2['b']:
    if j < 2:
        score2['c'] = 0
    elif j <= 2 and i < 4:
        score2['c'] = 1
    elif j >= 4 and i < 6:
        score2['c'] = 2
    elif j >= 6 and i < 8:
        score2['c'] = 3
    elif j == 8:
        score2['c'] = 4
        
print(score1)
print(score2)

When I run script it returns the following:

print(score1)
     a  b  c
0  foo  1  0

print(score2)
     a  b
1  bar  3

Why doesn't score2 create the new field "c" or a score?

SupaDupa
  • 91
  • 2
  • 10
  • 1
    Typo: The second loop needs to use `j < 4` instead of `i < 4`. – Barmar Jan 13 '23 at 00:34
  • 1
    And `j <= 2` should be `j >= 2`. But you don't really need the `>=` conditions, because the previous condition already precludes those. – Barmar Jan 13 '23 at 00:36
  • Because none of your condition satisfies on your second for loop. On the first iteration `j = 3` so that field `c` not added. – Hari E Jan 13 '23 at 00:38
  • the main problem is, as has already been stated, the ">" vs "<" and i/j typos (typical copy/paste errors), but it probably also should be noted (for posterity) that running this presents the following warning (tldr: the code itself can produce unpredictable results): _A value is trying to be set on a copy of a slice from a DataFrame. Try using_ `.loc[row_indexer,col_indexer] = value` _instead. See the caveats in the documentation:_ https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy – michael Jan 13 '23 at 04:46

2 Answers2

2

Avoid the use of for loops to conditionally update DataFrame columns which are not Python lists. Use vectorized methods of Pandas and Numpy such as numpy.select which scales to millions of rows! Remember these data science tools calculate much differently than general use Python:

# LIST OF BOOLEAN CONDITIONS
conds = [
    score1['b'].lt(3),                            # EQUIVALENT TO < 3
    score1['b'].between(3, 4, inclusive="left"),  # EQUIVALENT TO >= 3 or < 4
    score1['b'].between(4, 5, inclusive="left"),  # EQUIVALENT TO >= 4 or < 5
    score1['b'].between(5, 8, inclusive="left"),  # EQUIVALENT TO >= 5 or < 8
    score1['b'].eq(8)                             # EQUIVALENT TO == 8
]   

# LIST OF VALUES
vals = [0, 1, 2, 3, 4]

# VECTORIZED ASSIGNMENT
score1['c'] = numpy.select(conds, vals, default=numpy.nan)
# LIST OF BOOLEAN CONDITIONS
conds = [
    score2['b'].lt(2),
    score2['b'].between(2, 4, inclusive="left"),
    score2['b'].between(4, 6, inclusive="left"),
    score2['b'].between(6, 8, inclusive="left"),
    score2['b'].eq(8)
]   

# LIST OF VALUES
vals = [0, 1, 2, 3, 4]

# VECTORIZED ASSIGNMENT
score2['c'] = numpy.select(conds, vals, default=numpy.nan)
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Could you point me to anywhere where I can read more about "Avoid the use of for loops to conditionally update DataFrame columns which are not Python lists"? I assumed for loops were the way to go for this type of problem. I'm weak in numpy but I'll give this a shot (thank you). – SupaDupa Jan 13 '23 at 01:10
  • 1
    Those are my words. Again, you are thinking in basic, standard library Python with loops and not data science Python with vectorized methods. Consider reading online docs and tutorials such as [Intro to pandas](https://pandas.pydata.org/docs/getting_started/index.html#intro-to-pandas). In fact, the section *How to create new columns derived from existing columns?* mentions you do not need to loop. Also, many numpy array methods like `select` can be used on pandas Series or DataFrames. Keep reading, learning, and trying! Happy coding! – Parfait Jan 13 '23 at 04:15
  • 2
    here's a perhaps helpful q & a discussion, re: vectorization vs looping https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care (tldr: vectorization is essential at scale, looping is ok for small data) ... _however_, I'll just add that imho knowing how an API is designed to work with vectorization is essential to simply _using_ any of these APIs (pandas, numpy, all ML libraries), and once you learn these techniques, looping is more typing (and as we see here, error-prone), anyway. – michael Jan 13 '23 at 04:35
0

On the first iteration of second for loop, j will be in 3. so that none your condition satisfies.

for j in score2['b']:
    if j < 3:
        score2['c'] = 0
    elif j <= 3 and i < 5:
        score2['c'] = 1
    elif j >= 5 and i < 7:
        score2['c'] = 2
    elif j >= 7 and i < 9:
        score2['c'] = 3
    elif j == 9:
        score2['c'] = 4
Hari E
  • 526
  • 2
  • 14