0

I have a dataframe with various columns. I want to check whether each row satisfies a condition or not. The condition comes from a another CSV file, but here I provide a simplified example to illustrate my question:

The condition is that having a price less than 26000.

cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
        'Price': [22000,25000,27000,35000]
        }

mydata = pd.DataFrame(cars, columns = ['Brand','Price'], index=['Car_1','Car_2','Car_3','Car_4'])

the data looks like this:

print (df)
            Brand  Price
Car_1     Honda Civic  22000
Car_2  Toyota Corolla  25000
Car_3      Ford Focus  27000
Car_4         Audi A4  35000

So, I created another column with np.nan and in a for loop, I check whether that row satisfices that condition, and if yes, then I give the value of True to that cell.

mydata['condition'] = np.nan


                Brand  Price  condition
Car_1     Honda Civic  22000        NaN
Car_2  Toyota Corolla  25000        NaN
Car_3      Ford Focus  27000        NaN
Car_4         Audi A4  35000        NaN

and my fore loop is this:

for i in range(mydata.shape[0]):
 
    mydata.condition.iloc[i] = None

   if (mydata.Price.iloc[i] <= 26000):
                mydata.condition.iloc[i] = True

now, mydata looks like this:

            Brand  Price condition
Car_1     Honda Civic  22000      True
Car_2  Toyota Corolla  25000      True
Car_3      Ford Focus  27000      None
Car_4         Audi A4  35000      None

and if I use dropna() I will have the result I want:

filtered_results=mydata.dropna()


                Brand  Price condition
Car_1     Honda Civic  22000      True
Car_2  Toyota Corolla  25000      True

my problem is that I am getting a warning, as below:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)

My question is that what is the proper/efficient way of assigning value to a dataframe in this line to avoid the above error: mydata.condition.iloc[i] = True

I apricate your help.

Sean
  • 81
  • 1
  • 7

2 Answers2

2

Do not loop, you can just do in one pass:

mydata.loc[mydata.Price <= 26000, 'condition'] = True
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • I need to loop since there are two conditions not only one. This example is just simplified. Also, my problem is with the assigning part that gives the error. – Sean Nov 26 '20 at 19:27
  • what is the problem with this: `mydata.condition.iloc[i] = True` ? @Quang – Sean Nov 26 '20 at 19:29
  • That's the point for not looping, you don't need to handle that error. You can and should do a search on SO for that error to see why. If you have 2 condition, do it again for the other condition. – Quang Hoang Nov 26 '20 at 19:32
  • 1
    @Sean as for why it doesn't work or give you a warning: see [this doc](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing). – Quang Hoang Nov 26 '20 at 19:38
  • the problem is that the condition is not fixed. let me explain: Each row of `my_dataframe` has a code and a date. There is this other `refrence_Dataframe` that has the **entry date** and **exit date** for each code. I want to check each row of `my_dataframe` whether its date is in the **enter and exit interval** of the other `reference_dataframe` or not. – Sean Nov 26 '20 at 19:50
  • 1
    @Sean This isn't mentioned in your original post. Take it from me, looping is not the only way, nor the best way to do **I want to check each row of ...**. What you are asking is very likely to be an [XY-problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). As I see it, my answer solves your question as posted. I also provide you with additional information as to why your code doesn't work. Sorry if you don't think that this is working for you. – Quang Hoang Nov 26 '20 at 19:56
1

You can apply a function to each row for however many conditions you have. You can add more conditions to price_check to satisfy your requirement. It is not entirely clear from your question and when looking at your comments what your exact problem is. Quang Hoang has a way more efficient solution than using apply if his solution works for your problem.

def price_check(row):
    if row['Price'] <= 26000:
        return True
    else:
        return False

mydata['Price_check'] = mydata.apply(price_check, axis=1)

Brand   Price           Price  Price_check
Car_1   Honda Civic     22000   True
Car_2   Toyota Corolla  25000   True
Car_3   Ford Focus      27000   False
Car_4   Audi A4         35000   False
picmate 涅
  • 3,951
  • 5
  • 43
  • 52
  • 1
    *Why don't you apply a function*: because `apply` is generally bad? More info [in this question](https://stackoverflow.com/questions/54432583/when-should-i-not-want-to-use-pandas-apply-in-my-code). – Quang Hoang Nov 26 '20 at 19:33
  • OP needs to apply multiple conditions; so apply will iterate once and apply both conditions in one iteration rather than having to iterate multiple times which would justify the overhead apply function would take up; isn't it? – picmate 涅 Nov 26 '20 at 19:41
  • 1
    No, it isn't. Apply is generally looping and looping is generally slower. You still needs to do as many comparisons, but you lost vectorization. Besides, you have `np.select` for multiple conditions matching. – Quang Hoang Nov 26 '20 at 19:43
  • Agree with you. Better article comparing, naive looping, apply and vectorization; which could address OP's confusion as well: https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6 – picmate 涅 Nov 26 '20 at 19:52