1

I have a dataset of U.S. Education Datasets: Unification Project. I want to find out

  1. Number of rows where enrolment in grade 9 to 12 (column: GRADES_9_12_G) is less than 5000
  2. Number of rows where enrolment is grade 9 to 12 (column: GRADES_9_12_G) is between 10,000 and 20,000.

I am having problem in updating the count whenever the value in the if statement is correct.

import pandas as pd 
import numpy as np

df = pd.read_csv("C:/Users/akash/Downloads/states_all.csv")
df.shape

df = df.iloc[:, -6] 

for key, value in df.iteritems():
    count = 0
    count1 = 0
    if value < 5000:
        count += 1
    elif value < 20000 and value > 10000:
        count1 += 1

print(str(count) + str(count1))

df looks like this

0        196386.0

1         30847.0

2        175210.0

3        123113.0

4       1372011.0

5        160299.0

6        126917.0

7         28338.0

8         18173.0

9        511557.0

10       315539.0

11        43882.0

12        66541.0

13       495562.0

14       278161.0

15       138907.0

16       120960.0

17       181786.0

18       196891.0

19        59289.0

20       189795.0

21       230299.0

22       419351.0

23       224426.0

24       129554.0

25       235437.0

26        44449.0

27        79975.0

28        57605.0

29        47999.0

          ...    

1462          NaN

1463          NaN

1464          NaN

1465          NaN

1466          NaN

1467          NaN

1468          NaN

1469          NaN

1470          NaN

1471          NaN

1472          NaN

1473          NaN

1474          NaN

1475          NaN

1476          NaN

1477          NaN

1478          NaN

1479          NaN

1480          NaN

1481          NaN

1482          NaN

1483          NaN

1484          NaN

1485          NaN

1486          NaN

1487          NaN

1488          NaN

1489          NaN

1490          NaN

1491          NaN

Name: GRADES_9_12_G, Length: 1492, dtype: float64

In the output I got

00
d_kennetz
  • 5,219
  • 5
  • 21
  • 44
superduper
  • 401
  • 1
  • 5
  • 16
  • Firstly, is `value` numeric? The `if value < 5000:` part of the code might not execute. Same thing with the later `elif value < 20000 and value > 10000:` part. – Sam Apr 11 '19 at 20:45
  • Depending on what the df looks like, this is quite a simple question. Can you show the df? – Parmandeep Chaddha Apr 11 '19 at 20:54
  • 1
    Hi AKASH, this is not the way you want to handle this problem. Pandas has built-ins for this, but it would be beneficial for us if you could include your data or a small subset of your data that could reproduce your problem. – d_kennetz Apr 11 '19 at 20:54
  • For starters you reset your counts to zero on every iteration of the loop. But you likely shouldn't be using a loop at all, and rather, learn to use pandas built ins – juanpa.arrivillaga Apr 11 '19 at 20:59
  • I added the df, thanks, everyone. – superduper Apr 11 '19 at 21:08

2 Answers2

1

With Pandas, using loops is almost always the wrong way to go. You probably want something like this instead:

print(len(df.loc[df['GRADES_9_12_G'] < 5000]))    
print(len(df.loc[(10000 < df['GRADES_9_12_G']) & (df['GRADES_9_12_G'] < 20000)]))
0x5453
  • 12,753
  • 1
  • 32
  • 61
0

I downloaded your data set, and there are multiple ways to go about this. First of all, you do not need to subset your data if you do not want to. Your problem can be solved like this:

import pandas as pd

df = pd.read_csv('states_all.csv')
df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
print(len(df.loc[df['GRADES_9_12_G'] < 5000])) # 184
print(len(df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)])) # 52

The line df.loc[df['GRADES_9_12_G'] < 5000] is telling pandas to query the dataframe for all rows in column df['GRADES_9_12_G'] that are less than 5000. I am then calling python's builtin len function to return the length of the returned, which outputs 184. This is essentially a boolean masking process which returns all True values for your df that meet the conditions you give it.

The second query df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)] uses an & operator which is a bitwise operator that requires both conditions to be met for a row to be returned. We then call the len function on that as well to get an integer value of the number of rows which outputs 52.

To go off your method:

import pandas as pd
df = pd.read_csv('states_all.csv')
df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
df = df.iloc[:, -6] # select all rows for your column -6
print(len(df[df < 5000])) # query your "df" for all values less than 5k and print len
print(len(df[(df > 10000) & (df < 20000)])) # same as above, just for vals in between range

Why did I change the code in my answer instead of using yours?

Simply enough to say, it is more pandonic. Where we can, it is cleaner to use pandas built-ins than iterating over dataframes with for loops, as this is what pandas was designed for.

d_kennetz
  • 5,219
  • 5
  • 21
  • 44
  • No problem, if this solution was sufficient please consider accepting it as an answer by clicking the check mark under the arrows. – d_kennetz Apr 11 '19 at 21:26