0

I have a large dataframe (sample). I was filtering the data according to this code:

A = [f"A{i}" for i in range(50)]  
B = [f"B{i}" for i in range(50)]  
C = [f"C{i}" for i in range(50)]

for i in A:
 cond_A = (df[i]>= -0.0423) & (df[i]<=3)  
 filt_df = df[cond_A]
for i in B:
 cond_B = (filt_df[i]>= 15) & (filt_df[i]<=20)  
 filt_df2 = filt_df[cond_B]
for i in C:
 cond_C = (filt_df2[i]>= 15) & (filt_df2[i]<=20)
 filt_df3 = filt_df2[cond_B]
  1. When I print filt_df3, I am getting only an empty dataframe - why?
  2. How can I improve the code, other approaches like some advanced techniques?
  3. I am not sure the code above works as outlined in the edit below?
  4. I would like to know how can I change the code, such that it works as outlined in the edit below?

Edit:

  1. I want to remove the rows based on columns (A0 - A49) based on cond_A.
  2. Then filter the dataframe from 1 based on columns (B0 - B49) with cond_B.
  3. Then filter the dataframe from 2 based on columns (C0 - C49) with cond_C.

Thank you very much in advance.

Timus
  • 10,974
  • 5
  • 14
  • 28
mathew
  • 59
  • 6
  • please add a sample dataframe that makes the code work and easy to replicate. with the current code `df` is not defined. https://stackoverflow.com/help/minimal-reproducible-example – D.L Mar 23 '22 at 11:24
  • **1.** `df[i]>= -0,0423` should be `df[i]>= -0.0423` (decimal place with a dot, not a comma). **2.** Create a smaller version of your dataframe, like with 3 A's, 3 B's and 3 C's; instead of 50 of each. **3.** `filt_df = df[cond_A]` is inside the loop so will only ever hold the last filter result, so A49. Similarly for B49 and C49 in the next two loops. Your result is probably empty because column C49 doesn't have any rows that match from the previous B49, and for that, from the previous A49. **4. Explain in words** what you're trying to do because the logic of the code is unclear. – aneroid Mar 23 '22 at 11:26
  • @aneroid I have added a section Edit.I think now it's clear for you. – mathew Mar 24 '22 at 09:02
  • Okay, so what that actually means is: you want to filter all columns at the same time using one set of conditions for all the A* cols, then another for B* cols and another for C*. Next step: Create a smaller version of your dataframe, like with 3 A's, 3 B's and 3 C's; instead of 50 of each. Use `df.head(10).to_clipboard(excel=False)` to copy the first 10 rows of your dataframe to clipboard (as text) which you can then paste into your question. Read this post on [how to provide a **good pandas example**](http://stackoverflow.com/q/20109391). – aneroid Mar 24 '22 at 09:59
  • As has been pointed out: As long as the conditions should be used in _conjunction_ the result on the sample you've provided is an empty dataframe because at least one column doesn't satisfy it (e.g. A1). So do you actually want that - or do you want a _disjunction_? – Timus Mar 24 '22 at 20:57

3 Answers3

1

It seems to me that there is an issue with your codes when you are using the iteration to do the filtering. For example, filt_df is being overwritten in every iteration of the first loop. When the loop ends, filt_df only contains the data filtered with the conditions set in the last iteration. Is this what you intend to do?

And if you want to do the filtering efficient, you can try to use pandas.DataFrame.query (see documentation here). For example, if you want to filter out all rows with column B0 to B49 containing values between 0 and 200 inclusive, you can try to use the Python codes below (assuming that you have imported the raw data in the variable df below).

condition_list = [f'B{i} >= 0 & B{i} <= 200' for i in range(50)]
filter_str = ' & '.join(condition_list)
subset_df = df.query(filter_str)
print(subset_df)
Timus
  • 10,974
  • 5
  • 14
  • 28
starklikos
  • 154
  • 1
  • 1
  • 8
  • the asnwer to the question is No.I want to look into A0 to A49 whether it has values between -0.0423 and 3.All other rows must be filtered out. – mathew Mar 23 '22 at 15:50
  • I am getting an error in this line: subset_df = df.query(filter_str) ,ValueError: too many inputs – mathew Mar 24 '22 at 07:59
0

Since the column A1 contains only -0.057 which is outside [-0.0423, 3] everything gets filtered out.

Nevertheless, you seem not to take over the filter in every loop as filt_df{1|2|3} is reset.

This should work:

import pandas as pd

A = [f"A{i}" for i in range(50)]  
B = [f"B{i}" for i in range(50)]  
C = [f"C{i}" for i in range(50)]

filt_df = df.copy()
for i in A:
    cond_A = (df[i] >= -0.0423) & (df[i]<=3)  
    filt_df = filt_df[cond_A]
filt_df2 = filt_df.copy()
for i in B:
    cond_B = (filt_df[i]>= 15) & (filt_df[i]<=20)  
    filt_df2 = filt_df2[cond_B]
filt_df3 = filt_df2.copy()
for i in C:
    cond_C = (filt_df2[i]>= 15) & (filt_df2[i]<=20)
    filt_df3 = filt_df3[cond_B]

print(filt_df3)

Of course you will find a lot of filter tools in the pandas library that can be applied to multiple columns

For example this: https://stackoverflow.com/a/39820329/6139079

Engensmax
  • 109
  • 1
  • 9
  • I am getting a warning **UserWarning: Boolean Series key will be reindexed to match DataFrame index. filt_df = filt_df[cond_A]** when i print the filt_df2,i am getting an empty dataframe @engensmax – mathew Mar 24 '22 at 08:30
0

You can filter by all columns together with DataFrame.all for test if all rows match together:

A = [f"A{i}" for i in range(50)] 
cond_A = ((df[A] >= -0.0423) & (df[A]<=3)).all(axis=1)
B = [f"B{i}" for i in range(50)] 
cond_B = ((df[B]>= 15) & (df[B]<=20)).all(axis=1)
C = [f"C{i}" for i in range(50)] 
cond_C = ((df[C]>= 15) & (df[C]<=20)).all(axis=1)

And last chain all masks by & for bitwise AND:

filt_df = df[cond_A & cond_B & cond_C]

If get empty DataFrame it seems no row satisfy all conditions.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252