0

I am trying to filter a data frame based on two conditions.

  1. filter items where the field named 'Count' > 30 ...and...
  2. filter items where the field named 'LAND SQUARE FEET' < 5000

Here is the code that I tried, and I got errors, or I wouldn't be posting here.

df.loc[(df['Count']>=30) & (df['LAND SQUARE FEET']< 5000)['Count','LAND SQUARE FEET']]

df[df.eval("Count>=30 & (LAND SQUARE FEET <5000).values")]

How can I get this to work?

David Erickson
  • 16,433
  • 2
  • 19
  • 35
ASH
  • 20,759
  • 19
  • 87
  • 200

2 Answers2

1
df[(df["Count"] >= 30) & (df["LAND SQUARE FEET"] < 5000)]
kev1n
  • 140
  • 1
  • 7
  • the OP is also trying to filter by specific columns in addition to rows. Also, code only answers are not considered good answers. – David Erickson Oct 17 '20 at 01:21
  • 1
    @DavidErickson My bad. Kinda new to posting here. Thanks for the advice. – kev1n Oct 17 '20 at 01:23
  • @Kevin: I think your solution is very close, and this is interesting...apparently the 'Land Square Feet' is of object data type. I tried this: df["LAND SQUARE FEET"] = df["LAND SQUARE FEET"].astype(str).astype(int) I got this error: ValueError: invalid literal for int() with base 10: ' - ' – ASH Oct 17 '20 at 01:29
  • I would suggest doing .astype(str).astype(float) as suggested by this stackoverflow answer. https://stackoverflow.com/a/47764450/14102346 – kev1n Oct 17 '20 at 01:32
  • @ASH you have a string in your dataframe. I'll post a solution to show how to handle that. – David Erickson Oct 17 '20 at 01:40
1

Specifically, the error(see comments of other answer) is now saying there is a - in one of the values in your column, You can use my solution below OR you can do df["LAND SQUARE FEET"] = df["LAND SQUARE FEET"] =.replace('-','').astype(int) However, there may be other strings that you need to replace, that might mean you keep seeing errors if there are more strings, other than -, for example , or ft.. Also, that line with - might be bad data altogether as I'm not sure why a - would be in a number that is supposed to be an integer of square feet.

Also, you can look at that line specifically with df[df["LAND SQUARE FEET"].str.contains('-') and from there decide what you want to do with it -- either manipulate it's data with replace or make it NaN with pd.to_numeric()


Solution with pd.to_numeric():

You need to use pd.to_numeric() first as you have strings in your column and passing errors='coerce' changes to NaN for those values that are strings. The data type of the column should now be a float if you call df.info():

Step 1:

df = pd.DataFrame({"LAND SQUARE FEET" : [4500, '4,400 feet', '4,600', 4700, 5500, 6000],
                   "Count" : [45,55,65,75,15,25]})
df
Out[1]: 
  LAND SQUARE FEET  Count
0             4500     45
1       4,400 feet     55
2            4,600     65
3             4700     75
4             5500     15
5             6000     25

Step 2:

df = pd.DataFrame({"LAND SQUARE FEET" : [4500, '4,400 feet', '4,600', 4700, 5500, 6000],
                   "Count" : [45,55,65,75,15,25]})
df["LAND SQUARE FEET"] = pd.to_numeric(df["LAND SQUARE FEET"], errors='coerce')
df
Out[2]: 
   LAND SQUARE FEET  Count
0            4500.0     45
1               NaN     55
2               NaN     65
3            4700.0     75
4            5500.0     15
5            6000.0     25

Step 3 (and final output):

df = pd.DataFrame({"LAND SQUARE FEET" : [4500, '4,400 feet', '4,600', 4700, 5500, 6000],
                   "Count" : [45,55,65,75,15,25]})
df["LAND SQUARE FEET"] = pd.to_numeric(df["LAND SQUARE FEET"], errors='coerce')
new_df = df.loc[(df['Count']>=30) & (df['LAND SQUARE FEET']< 5000),['Count','LAND SQUARE FEET']]
new_df
Out[3]: 
   Count  LAND SQUARE FEET
0     45            4500.0
3     75            4700.0
David Erickson
  • 16,433
  • 2
  • 19
  • 35