0

Basically I have census data of US that I have read in Pandas from a csv file. Now I have to write a function that finds counties in a specific manner (not gonna explain that because that's not what the question is about) from the table I have gotten from csv file and return those counties.

MY TRY:

What I did is that I created lists with the names of the columns (that the function has to return), then applied the specific condition in the for loop using if-statement to read the entries of all required columns in their respective list. Now I created a new DataFrame and I want to read the entries from lists into this new DataFrame. I tried the same for loop to accomplish it, but all in vain, tried to make Series out of those lists and tried passing them as a parameter in the DataFrame, still all in vain, made DataFrames out of those lists and tried using append() function to concatenate them, but still all in vain. Any help would be appreciated.

CODE:

#idxl = list()
#st = list()
#cty = list()
idx2 = 0
cty_reg = pd.DataFrame(columns = ('STNAME', 'CTYNAME'))
for idx in range(census_df['CTYNAME'].count()):
    if((census_df.iloc[idx]['REGION'] == 1 or census_df.iloc[idx]['REGION'] == 2) and (census_df.iloc[idx]['POPESTIMATE2015'] > census_df.iloc[idx]['POPESTIMATE2014']) and census_df.loc[idx]['CTYNAME'].startswith('Washington')):
    #idxl.append(census_df.index[idx])
    #st.append(census_df.iloc[idx]['STNAME'])
    #cty.append(census_df.iloc[idx]['CTYNAME'])
    cty_reg.index[idx2] = census_df.index[idx]
    cty_reg.iloc[idxl2]['STNAME'] = census_df.iloc[idx]['STNAME']
    cty_reg.iloc[idxl2]['CTYNAME'] = census_df.iloc[idx]['CTYNAME']
    idx2 = idx2 + 1
cty_reg

CENSUS TABLE PIC:

enter image description here

SAMPLE TABLE:

   REGION  STNAME        CTYNAME
0       2  "Wisconsin"   "Washington County"
1       2  "Alabama"     "Washington County"
2       1  "Texas"       "Atauga County"
3       0  "California"  "Washington County"

SAMPLE OUTPUT:

  STNAME      CTYNAME
0 Wisconsin  Washington County
1 Alabama    Washington County

I am sorry for the less-knowledge about the US-states and counties, I just randomly put the state names and counties in the sample table, just to show you what do I want to get out of that. Thanks for the help in advanced.

ASGM
  • 11,051
  • 1
  • 32
  • 53
Khubaib Khawar
  • 161
  • 1
  • 1
  • 9
  • 4
    Can you give us a small sample dataframe for both your input and (desired) output? That will make it easier for people to answer your question. – ASGM Oct 23 '18 at 00:42
  • Done, added the sample table and sample output – Khubaib Khawar Oct 23 '18 at 00:57
  • 2
    Okay, great. I've improved the formatting so people can paste it into their code. Now, can you be clear about the logic you're using for picking these rows? Don't you need the population columns in your input? – ASGM Oct 23 '18 at 01:16
  • That's right, I only need `STNAME` and `CTYNAME` columns in my output and the logic you explained in your answer is correct, that's what I was applying to get the desired `DataFrame`. Thanks for your answer as it has helped me a lot. I basically knew that Pandas makes these conditionals very easy, but at the time of writing code my mind was not focused on the easiness of Pandas. Thanks again. Can you read my comment on your answer and reply me there? – Khubaib Khawar Oct 23 '18 at 09:21
  • I need the population columns in my input, but in sample I just did not write because it would be a long table then. By the way I have already solved the question, thanks. – Khubaib Khawar Oct 23 '18 at 10:45
  • I just need to know that if I can use `and` and `or` instead of symbols `&` and `|`. If yes, then how would it be possible? just replace symbols with words or there is special formatting for that? – Khubaib Khawar Oct 23 '18 at 10:46

3 Answers3

1

If I'm reading the logic in your code right, you want to select rows according to the following conditions:

  • REGION should be 1 or 2
  • POPESTIMATE2015 > POPESTIMATE2014
  • CTYNAME needs to start with "Washington"

In general, Pandas makes it easy to select rows based on conditions without having to iterate over the dataframe:

df = census_df[
        ((df.REGION == 1) | (df.REGION == 2)) & \
        (df.POPESTIMATE2015 > POPESTIMATE2014) & \
        (df.CTYNAME.str.startswith('Washington'))
    ]
ASGM
  • 11,051
  • 1
  • 32
  • 53
  • Thanks for the answer, it helped me to understand that Pandas makes it easier. By the way can you tell me why are you using the 'fore-slash' ``\\` after the 'AND' `&` symbol? Also can I use the words `and` and `or` instead of `&` and `|` symbols? – Khubaib Khawar Oct 23 '18 at 03:37
1

There are some missing columns in the source DF posted in the OP. However, reading the loop I don't think the loop is required at all. There are 3 filters required - for REGION, POPESTIMATE2015 and CTYNAME. If I have understood the logic in the OP, then this should be feasible without the loop

Option 1 - original answer

print df.loc[
            (df.REGION.isin([1,2])) & \
            (df.POPESTIMATE2015 > df.POPESTIMATE2014) & \
            (df.CTYNAME.str.startswith('Washington')), \
                          ['REGION', 'STNAME', 'CTYNAME']]

Option 2 - using and with pd.eval

q = pd.eval("(df.REGION.isin([1,2])) and \
            (df.POPESTIMATE2015 > df.POPESTIMATE2014) and \
            (df.CTYNAME.str.startswith('Washington'))", \
            engine='python')
print df.loc[q, ['REGION', 'STNAME', 'CTYNAME']]

Option 3 - using and with df.query

regions_list = [1,2]
dfq = df.query("(REGION==@regions_list) and \
              (POPESTIMATE2015 > POPESTIMATE2014) and \
              (CTYNAME.str.startswith('Washington'))", \
              engine='python')
print dfq[['REGION', 'STNAME', 'CTYNAME']]
edesz
  • 11,756
  • 22
  • 75
  • 123
  • thanks a lot this really helped me to chose the only required Columns from the final Table. Can you tell me why are you using the 'fore-slash' ``\\` after the 'AND' `&` symbol? Also can I use the words `and` and `or` instead of `&` and `|` symbols? – Khubaib Khawar Oct 23 '18 at 03:40
  • 1
    The \ was to [extend the command across multiple lines](https://stackoverflow.com/a/4172465/4057186). You should read about Boolean indexing - [1](https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html#Boolean-operators), [2](https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing), [3](https://stackoverflow.com/a/21415990/4057186). – edesz Oct 23 '18 at 05:34
  • @KhubaibKhawar, I missed your question about using `and` and `or`. The 3 links I posted for Boolean indexing in Pandas are certainly valid, but if you strictly want to use `and` and `or` then you have to use [`.eval`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.eval.html#pandas-eval) or [`.query`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html#pandas-dataframe-query). I have edited my answer to show these possible uses in the OP. Related [1](https://stackoverflow.com/a/50980994/4057186) and [2](https://stackoverflow.com/a/47723335/4057186). – edesz Oct 30 '18 at 18:39
1

Assuming you're selecting some kind of rows that satisfy a criteria, let's just say that select(row) and this function returns True if selected or False if not. I'll not infer what it is because you specifically said it was not important

And then you wanted the STNAME and CTYNAME of that row.

So here's what you would do:

your_new_df = census_df[census_df.apply(select, axis=1)]\
.apply(lambda x: x[['STNAME', 'CTYNAME']], axis=1)

This is the one liner that will get you what you wanted provided you wrote the select function that will pick the rows.

Rocky Li
  • 5,641
  • 2
  • 17
  • 33