-1

The UCI congressional vote dataset where 1.0 is yay, 0.0 is nay and NaN is abstain. The second set of columns is what I'm trying to add to the dataframe but those values are incorrect. I am trying to binarize this dataframe so have something like:

100 for yay
010 for nay
001 for abstain

so I can run association rules. I was able to create 16 extra columns (because there are 16 votes from v1 to v16) for abstain. However, when I try to create the 16 nay columns by checking what the value is in the original vote column shown above, it does not work as you can see above say for nay_v1 it should be 1,1,0,1,0 but it is 1,1,1,1,1. The abstain columns were created by using isna() but for nay I want to check if the vote column value is 0.0 and if so input 1.0 for the nay column for that vote.

I tried two ways using loc and iloc based on answers on this site but neither work, I think outputs were both what I posted above.

First method:

for (idx, row) in cvotes.iterrows():
    for c in cols:
        if row.loc[c]==0.0:
            cvotes[f'nay_{c}'] = 1.0
        elif row.loc[c] == 1.0:
            cvotes[f'nay_{c}'] = 0.0
        elif row.loc[c] == np.nan:
            cvotes[f'nay_{c}'] = 0.0

Second method:

for c in cols:
    for i in range(len(cvotes.iloc[:][c])):
        val = cvotes.iloc[i][c]
        if val == 0.0:
            cvotes[f'nay_{c}'] = 1.0
        else:
            cvotes[f'nay_{c}'] = 0.0

What am I doing wrong here? It's fairly frustrating because I thought I was okay with numpy array indexing and even Python list indexing.

Edit:

Sample dataframe:

cvotes = pd.read_csv('house-votes-84.data', sep=',', header=None)
cvotes.head()
cvotes.columns = ['party', 'v1','v2','v3', 'v4','v5','v6','v7',
                  'v8', 'v9', 'v10', 'v11', 'v12', 'v13','v14','v15',
                  'v16']

cvotes.head()

Download csv from: http://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records

This is the result I want:

v1  nay_v1
0.0 1.0
0.0 1.0
NaN 0.0
0.0 1.0
1.0 0.0

Updated my code but now I just get 0's

# make cols for is nay 
for c in cols:
    #make column preset to val
    cvotes[f'nay_{c}']= 0.0
    #iterate and change vals on vote col condition
    for i in range(len(cvotes.iloc[:][c])):
        val = cvotes.iloc[i][c]
        #print(val)
        if val == 0.0:
            cvotes.iloc[i][f'nay_{c}'] = 1.0
        else:
            cvotes.iloc[i][f'nay_{c}'] = 0.0
mLstudent33
  • 1,033
  • 3
  • 14
  • 32
  • please copy and paste sample dataframes and your expected output, we can't work with images – ansev Mar 13 '20 at 21:13
  • @ansev done, thanks for patience. – mLstudent33 Mar 13 '20 at 21:19
  • My expected output is stated in the given example for `nay_v1` – mLstudent33 Mar 13 '20 at 21:20
  • 1
    @mLstudent33 it is still not clear to me what you want your expected output should be. Could you type something up in Excel with a couple of rows and copy and paste as code? – David Erickson Mar 13 '20 at 21:36
  • actually I think it's just simple indexing error I'm fixing now in my ipynb and going to test. Like this instead: `cvotes[i][f'nay_{c}'] = 1.0` – mLstudent33 Mar 13 '20 at 21:39
  • 1
    Does this answer your question? [Pandas/Python: Set value of one column based on value in another column](https://stackoverflow.com/questions/49161120/pandas-python-set-value-of-one-column-based-on-value-in-another-column) – AMC Mar 14 '20 at 01:10
  • I saw that but I had some problems fitting my use case into that example – mLstudent33 Mar 14 '20 at 01:38

2 Answers2

0

I saw this: Pandas/Python: Set value of one column based on value in another column

And did:

for c in cols:
    cvotes[f'nay_{c}'] = cvotes[c]
    cvotes.loc[cvotes[c] == 0.0, f'nay_{c}']=1.0
    cvotes.loc[cvotes[c] == 1.0, f'nay_{c}']=0.0
    cvotes.loc[cvotes[c].isna(), f'nay_{c}']=0.0




pd.set_option('display.max_columns', None)
cvotes.head()

gets the correct output: enter image description here

mLstudent33
  • 1,033
  • 3
  • 14
  • 32
  • also if anyone comes here to look, just saw that pd.get_dummies() might have save me a ton of time. https://towardsdatascience.com/the-dummys-guide-to-creating-dummy-variables-f21faddb1d40 – mLstudent33 Mar 13 '20 at 22:24
0
# try dummy variables for each column of votes
v1 = pd.get_dummies(cvotes['v1'])
v1.head()

outputs: enter image description here

mLstudent33
  • 1,033
  • 3
  • 14
  • 32